Skip to content
/ vllm Public
forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

License

Notifications You must be signed in to change notification settings

Adlik/vllm

 
 

Repository files navigation

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Discord |

Performance Comparison

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

Install

Install vLLM with pip or from source:

pip install vllm

Quantization

Quantization in previous versions already supports AWQ and SqueezeLLM quantization. This submission adds an int8 quantized submission, using the bitsandbytes core to perform 8-bit operations. At the same time, we also submitted a new 4-bit quantization implementation. We implement 4-bit groupwise quantization (RTN) for linear layers on vLLM. The smoothed model is directly loaded into vLLM, which automatically completes 4-bit weight quantization. In vLLM, we have implemented an efficient W4A16 CUDA kernel optimized from lmdeploy for the quantization of linear layers, which further enhances the acceleration effect. We will soon submit a smoothquant+ algorithm to another git library. This algorithm smoothes the model by channel. By using SmoothQuant+, the Code Llama-34B can be quantified and deployed on a single 40G A100 GPU, with lossless accuracy and a throughput increase of 1.9-4.0 times compared to the FP16 model deployed on two 40G A100 GPUs. The latency per token is only 68% of the FP16 model deployed on two 40G A100 GPUs. This is the state-of-the-art 4-bit weight quantization as we know.

single sequence generation with Code Llama-34B on the ShareGPT Comparison of inference throughput wiht Code Llama-34B

Configuring int8 or int4 is also very simple, just set auto_quant_mode to llm_int8 or weight_int4.

llm = LLM(model="facebook/opt-125m",
          trust_remote_code=True,
          dtype='float16',
          #auto_quant_mode="llm_int8")
          auto_quant_mode="weight_int4")

Acknowledgements

We would like to thank the following projects for their excellent work:

About

A high-throughput and memory-efficient inference and serving engine for LLMs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 73.8%
  • Cuda 17.5%
  • C++ 7.9%
  • Other 0.8%