Skip to content

💻A small Collection for Awesome LLM Inference [Papers|Blogs|Docs] with codes, contains TensorRT-LLM, streaming-llm, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

License

nndeploy/Awesome-LLM-Inference

 
 

Repository files navigation

v02

📒Introduction

Awesome-LLM-Inference: A small collection for Awesome LLM Inference [Papers|Blogs|Tech Report|Docs] with codes, please check 📙Awesome LLM Inference Papers with Codes for more details.

🎉Download PDFs

Awesome-LLM-Inference-v0.3.pdf: 500 pages, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant 1/2/FP, AWQ, FlashDecoding, FlashDecoding++, FP8-LM, LLM-FP4, StreamLLM etc.

📙Awesome LLM Inference Papers with Codes

Date Title Paper Code Recommend
2018.03 [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2018.05 [Online Softmax] Online normalizer calculation for softmax [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2020.05 🔥🔥[Megatron-LM] Training Multi-Billion Parameter Language Models Using Model Parallelism [arxiv][pdf] [GitHub][Megatron-LM] ⭐️⭐️⭐️⭐️⭐️
2021.04 [RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING [arxiv][pdf] [GitHub][transformers] ⭐️⭐️⭐️
2022.05 🔥🔥[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness [arxiv][pdf] [GitHub][flash-attention] ⭐️⭐️⭐️⭐️⭐️
2022.06 🔥🔥[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [arxiv][pdf] [GitHub][DeepSpeed] ⭐️⭐️⭐️⭐️⭐️
2022.07 🔥🔥[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models [osdi22-yu][pdf] ⚠️ ⭐️⭐️⭐️⭐️⭐️
2022.08 [FP8-Quantization] FP8 Quantization: The Power of the Exponent [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2022.08 [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale [arxiv][pdf] [GitHub][bitsandbytes] ⭐️⭐️⭐️
2022.09 [FP8] FP8 FORMATS FOR DEEP LEARNING [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2022.10 [Online Softmax] SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2022.10 [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs [arxiv][pdf] [GitHub][ByteTransformer] ⭐️⭐️⭐️
2022.11 🔥🔥[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production [arxiv][pdf] [GitHub][FasterTransformer] ⭐️⭐️⭐️⭐️⭐️
2022.11 🔥🔥[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models [arxiv][pdf] [GitHub][smoothquant] ⭐️⭐️⭐️⭐️⭐️
2023.03 [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU [arxiv][pdf] [GitHub][FlexGen] ⭐️⭐️⭐️
2023.03 [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [arxiv][pdf] [GitHub][DeepSpeed] ⭐️⭐️⭐️
2023.05 [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification [arxiv][pdf] [GitHub][FlexFlow] ⭐️⭐️⭐️
2023.05 [FastServe] Fast Distributed Inference Serving for Large Language Models [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2023.05 [FlashAttention] From Online Softmax to FlashAttention [cse599m][flashattn.pdf] ⚠️ ⭐️⭐️⭐️⭐️⭐️
2023.05 [FLOP, I/O] Dissecting Batching Effects in GPT Inference [blog en/cn] ⚠️ ⭐️⭐️⭐️
2023.06 [Sparse FlashAttention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention [arxiv][pdf] [GitHub][dynamic-sparse-flash-attention] ⭐️⭐️⭐️
2023.06 🔥🔥[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [arxiv][pdf] [GitHub][llm-awq] ⭐️⭐️⭐️⭐️⭐️
2023.06 [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [arxiv][pdf] [GitHub][SpQR] ⭐️⭐️⭐️
2023.06 [SqueezeLLM] SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION [arxiv][pdf] [GitHub][SqueezeLLM] ⭐️⭐️⭐️
2023.07 [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [arxiv][pdf] [GitHub][DeepSpeed] ⭐️⭐️⭐️
2023.07 🔥🔥[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning [arxiv][pdf] [GitHub][flash-attention] ⭐️⭐️⭐️⭐️⭐️
2023.08 [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library [arxiv][pdf] [GitHub][wmma_extension] ⭐️⭐️⭐️
2023.09 🔥🔥[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention [arxiv][pdf] [GitHub][vllm] ⭐️⭐️⭐️⭐️⭐️
2023.09 [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS [arxiv][pdf] [GitHub][streaming-llm] ⭐️⭐️⭐️
2023.09 [KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization [ZhiHu Tech Blog] ⚠️ ⭐️⭐️⭐️
2023.09 [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads [blog] [GitHub][Medusa] ⭐️⭐️⭐️
2023.10 🔥🔥[Flash-Decoding] Flash-Decoding for long-context inference [tech report] [GitHub][flash-attention] ⭐️⭐️⭐️⭐️⭐️
2023.10 🔥🔥[TensorRT-LLM] NVIDIA TensorRT LLM [TensorRT-LLM’s Docs] [GitHub][TensorRT-LLM] ⭐️⭐️⭐️⭐️⭐️
2023.10 [FP8-LM] FP8-LM: Training FP8 Large Language Models [arxiv][pdf] [GitHub][MS-AMP] ⭐️⭐️⭐️
2023.10 [LLM-Shearing] SHEARED LLAMA: ACCELERATING LANGUAGE MODEL PRE-TRAINING VIA STRUCTURED PRUNING [arxiv][pdf] [GitHub][LLM-Shearing] ⭐️⭐️⭐️
2023.11 [Flash-Decoding++] FLASHDECODING++: FASTER LARGE LANGUAGE MODEL INFERENCE ON GPUS [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2023.11 [LLM CPU Inference] Efficient LLM Inference on CPUs [arxiv][pdf] [GitHub][intel-extension-for-transformers] ⭐️⭐️⭐️
2023.10 [LLM-FP4] LLM-FP4: 4-Bit Floating-Point Quantized Transformers [arxiv][pdf] [GitHub][LLM-FP4] ⭐️⭐️⭐️
2023.01 [SparseGPT] SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot [arxiv][pdf] [GitHub][sparsegpt] ⭐️⭐️⭐️
2023.11 🔥🔥[DeepSpeed-FastGen 2x vLLM] DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference [github][blog] [GitHub][deepspeed-fastgen] ⭐️⭐️⭐️⭐️⭐️
2023.11 🔥🔥[HyperAttention] HyperAttention: Long-context Attention in Near-Linear Time [arxiv][pdf] ⚠️ ⭐️⭐️⭐️⭐️⭐️
2023.11 [Splitwise] Splitwise: Efficient Generative LLM Inference Using Phase Splitting [arxiv][pdf] ⚠️ ⭐️⭐️⭐️
2023.11 [2-bit LLM] Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization [arxiv][pdf] ⚠️ ⭐️⭐️⭐️

©️License

GNU General Public License v3.0

About

💻A small Collection for Awesome LLM Inference [Papers|Blogs|Docs] with codes, contains TensorRT-LLM, streaming-llm, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published