GitHub - Adlik/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

Easy, fast, and cheap LLM serving for everyone

Performance Comparison

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server

Install

Install vLLM with pip or from source:

pip install vllm

Quantization

Quantization in previous versions already supports AWQ and SqueezeLLM quantization. This submission adds an int8 quantized submission, using the bitsandbytes core to perform 8-bit operations. At the same time, we also submitted a new 4-bit quantization implementation. We implement 4-bit groupwise quantization (RTN) for linear layers on vLLM. The smoothed model is directly loaded into vLLM, which automatically completes 4-bit weight quantization. In vLLM, we have implemented an efficient W4A16 CUDA kernel optimized from lmdeploy for the quantization of linear layers, which further enhances the acceleration effect. We will soon submit a smoothquant+ algorithm to another git library. This algorithm smoothes the model by channel. By using SmoothQuant+, the Code Llama-34B can be quantified and deployed on a single 40G A100 GPU, with lossless accuracy and a throughput increase of 1.9-4.0 times compared to the FP16 model deployed on two 40G A100 GPUs. The latency per token is only 68% of the FP16 model deployed on two 40G A100 GPUs. This is the state-of-the-art 4-bit weight quantization as we know.

single sequence generation with Code Llama-34B on the ShareGPT

Comparison of inference throughput wiht Code Llama-34B

Configuring int8 or int4 is also very simple, just set auto_quant_mode to llm_int8 or weight_int4.

llm = LLM(model="facebook/opt-125m",
          trust_remote_code=True,
          dtype='float16',
          #auto_quant_mode="llm_int8")
          auto_quant_mode="weight_int4")

Acknowledgements

We would like to thank the following projects for their excellent work:

Name		Name	Last commit message	Last commit date
Latest commit History 453 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
csrc		csrc
docs		docs
examples		examples
images		images
tests		tests
vllm		vllm
.gitignore		.gitignore
.pylintrc		.pylintrc
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
format.sh		format.sh
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

Adlik/vllm

Folders and files

Latest commit

History

Repository files navigation

Easy, fast, and cheap LLM serving for everyone

Performance Comparison

Install

Quantization

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages