Releases: vllm-project/vllm
Releases · vllm-project/vllm
v0.4.2
Highlights
Features
- Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
- Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
- Support FlashInfer as attention backend (#4353)
Models and Enhancements
- Add support for Phi-3-mini (#4298, #4372, #4380)
- Add more histogram metrics (#2764, #4523)
- Full tensor parallelism for LoRA layers (#3524)
- Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)
Dependency Upgrade
- Upgrade to
torch==2.3.0
(#4454) - Upgrade to
tensorizer==2.9.0
(#4467) - Expansion of AMD test suite (#4267)
Progress and Dev Experience
- Centralize and document all environment variables (#4548, #4574)
- Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
- Progress towards pipeline parallelism (#4512, #4444, #4566)
- Progress towards multiprocessing based executors (#4348, #4402, #4419)
- Progress towards FP8 support (#4343, #4332, 4527)
What's Changed
- [Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
- [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
- [Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
- [Doc] Add note for docker user by @youkaichao in #4340
- [Misc] Use public API in benchmark_throughput by @zifeitong in #4300
- [Model] Adds Phi-3 support by @caiom in #4298
- [Core] Move ray_utils.py from
engine
toexecutor
package by @njhill in #4347 - [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
- [CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
- [Doc] README Phi-3 name fix. by @caiom in #4372
- [Core]refactor aqlm quant ops by @jikunshang in #4351
- [Mypy] Typing lora folder by @rkooo567 in #4337
- [Misc] Optimize flash attention backend log by @esmeetu in #4368
- [Core] Add
shutdown()
method toExecutorBase
by @njhill in #4349 - [Core] Move function tracing setup to util function by @njhill in #4352
- [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
- [Bugfix] Fix parameter name in
get_tokenizer
by @DarkLight1337 in #4107 - [Frontend] Add --log-level option to api server by @normster in #4377
- [CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
- [Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
- [Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
- [Misc] add RFC issue template by @youkaichao in #4401
- [Core] Introduce
DistributedGPUExecutor
abstract class by @njhill in #4348 - [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
- [Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
- [Misc] Fix logger format typo by @esmeetu in #4396
- [ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
- [Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
- [Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
- [Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
- [BugFix] Fix
min_tokens
wheneos_token_id
is None by @njhill in #4389 - ✨ support local cache for models by @prashantgupta24 in #4374
- [BugFix] Fix return type of executor execute_model methods by @njhill in #4402
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
- [Misc] fix typo in llm_engine init logging by @DefTruth in #4428
- Add more Prometheus metrics by @ronensc in #2764
- [CI] clean docker cache for neuron by @simon-mo in #4441
- [mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
- [CI] hotfix: soft fail neuron test by @simon-mo in #4458
- [Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
- [Misc] Upgrade to
torch==2.3.0
by @mgoin in #4454 - [Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
- [Core]Refactor gptq_marlin ops by @jikunshang in #4466
- [BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
- [Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
- [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
- [Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
- [Frontend] [Core] Tensorizer: support dynamic
num_readers
, update version by @alpayariyak in #4467 - [Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
- fix_tokenizer_snapshot_download_bug by @kingljl in #4493
- Unable to find Punica extension issue during source code installation by @kingljl in #4494
- [Core] Centralize GPU Worker construction by @njhill in #4419
- [Misc][Typo] type annotation fix by @HarryWu99 in #4495
- [Misc] fix typo in block manager by @Juelianqvq in #4453
- Allow user to define whitespace pattern for outlines by @robcaulk in #4305
- [Misc]Add customized information for models by @jeejeelee in #4132
- [Test] Add ignore_eos test by @rkooo567 in #4519
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
- [Bugfix] Fix 307 Redirect for
/metrics
by @robertgshaw2-neuralmagic in #4523 - [Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
- [Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
- [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
- [Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
- [Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
- [Core] Add
multiproc_worker_utils
for multiprocessing-based workers by @njhill in #4357 - [Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
- [Bugfix] Add validation for seed by @sasha0552 in #4529
- [Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
- [Core][Distributed] fix pynccl del error by @youkaichao in #4508
- [Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
- [Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens i...
v0.4.1
Highlights
Features
- Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
- Support private model registration, and updating our support policy (#3871, 3948)
- Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
- Add option for using LM Format Enforcer for guided decoding (#3868)
- Add option for optionally initialize tokenizer and detokenizer (#3748)
- Add option for load model using
tensorizer
(#3476)
Enhancements
- vLLM is now mostly type checked by
mypy
(#3816, #4006, #4161, #4043) - Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
- Progress towards speculative decoding (#3250, #3706, #3894)
- Initial support with dynamic per-tensor scaling via FP8 (#4118)
Hardwares
- Intel CPU inference backend is added (#3993, #3634)
- AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)
What's Changed
- [Kernel] Layernorm performance optimization by @mawong-amd in #3662
- [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
- [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
- [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
- [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
- [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
- [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
- [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
- [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
- [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
- [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
- [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
- [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
- Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
- [Bugfix] Add
__init__.py
files forvllm/core/block/
andvllm/spec_decode/
by @mgoin in #3798 - [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
- [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
- [BugFix] Use different mechanism to get vllm version in
is_cpu()
by @njhill in #3804 - [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
- [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
- [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
- Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
- [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
- Fixes the argument for local_tokenizer_group by @sighingnow in #3754
- [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
- [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
- [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
- [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
- [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
- [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
- [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
- [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
- [Core] improve robustness of pynccl by @youkaichao in #3860
- [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
- [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
- [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
- [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
- [Bugfix] Fixing requirements.txt by @noamgat in #3865
- [Misc] Define common requirements by @WoosukKwon in #3841
- Add option to completion API to truncate prompt tokens by @tdoublep in #3144
- [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
- [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
- [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
- [Core] enable out-of-tree model register by @youkaichao in #3871
- [WIP][Core] latency optimization by @youkaichao in #3890
- [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
- [Model] add minicpm by @SUDA-HLT-ywfang in #3893
- [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
- [Bugfix] Enable Proper
attention_bias
Usage in Llama Model Configuration by @Ki6an in #3767 - [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
- [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
- [Core] separate distributed_init from worker by @youkaichao in #3904
- [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
- [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
- [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
- [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
- [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
- [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
- [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
- [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
- [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
- [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
- [Doc] Add doc to state our model support policy by @youkaichao in #3948
- [Bugfix] Remove key sorting for
guided_json
parameter in OpenAi compatible Server by @dmarasco in #3945 - [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
- [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
- [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
- [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
- [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
- [Test] Add xformer and flash attn tests by @rkooo567 in #3961
- [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
- [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
- [Ke...
v0.4.0.post1, restore sm70/75 support
Highlight
v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.
What's Changed
- [Kernel] Layernorm performance optimization by @mawong-amd in #3662
- [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
- [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
- [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
- [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
- [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
- [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
- [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
- [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
- [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
- [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
- [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
- [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
- [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
- Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
- [Bugfix] Add
__init__.py
files forvllm/core/block/
andvllm/spec_decode/
by @mgoin in #3798 - [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
New Contributors
- @mawong-amd made their first contribution in #3662
- @Qubitium made their first contribution in #3689
- @bigPYJ1151 made their first contribution in #3634
- @A-Mahla made their first contribution in #3788
Full Changelog: v0.4.0...v0.4.0.post1
v0.4.0
Major changes
Models
- New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
- New vision language model: LLaVA (#3042)
Production features
- Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag
--enable-prefix-caching
to turn it on. - Support
json_object
in OpenAI server for arbitrary JSON,--use-delay
flag to improve time to first token across many requests, andmin_tokens
to EOS suppression. - Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
- Custom all reduce kernel has been re-enabled after more robustness fixes.
- Replaced cupy dependency due to its bugs.
Hardware
- Improved Neuron support for AWS Inferentia.
- CMake based build system for extensibility.
Ecosystem
What's Changed
- allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
- Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
- Add Automatic Prefix Caching by @SageMoore in #2762
- Add vLLM version info to logs and openai API server by @jasonacox in #3161
- [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
- Make it easy to profile workers with nsight by @pcmoritz in #3162
- [DOC] add setup document to support neuron backend by @liangfu in #2777
- [Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
- Add document for vllm paged attention kernel. by @pian13131 in #2978
- enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
- [Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
- Push logprob generation to LLMEngine by @Yard1 in #3065
- Add health check, make async Engine more robust by @Yard1 in #3015
- Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
- [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
- Store
eos_token_id
inSequence
for easy access by @njhill in #3166 - [Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
- [Tests] Add block manager and scheduler tests by @rkooo567 in #3108
- [Testing] Fix core tests by @cadedaniel in #3224
- A simple addition of
dynamic_ncols=True
by @chujiezheng in #3242 - Add GPTQ support for Gemma by @TechxGenus in #3200
- Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
- Separate attention backends by @WoosukKwon in #3005
- Measure model memory usage by @mgoin in #3120
- Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
- Fix auto prefix bug by @ElizaWszola in #3239
- Connect engine healthcheck to openai server by @njhill in #3260
- Feature add lora support for Qwen2 by @whyiug in #3177
- [Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
- [Docs] Fix Unmocked Imports by @ywang96 in #3275
- [FIX] Make
flash_attn
optional by @WoosukKwon in #3269 - Move model filelocks from
/tmp/
to~/.cache/vllm/locks/
dir by @mgoin in #3241 - [FIX] Fix prefix test error on main by @zhuohan123 in #3286
- [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
- Enhance lora tests with more layer and rank variations by @tterrysun in #3243
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
- [BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
- [Fix] Fix best_of behavior when n=1 by @njhill in #3298
- Re-enable the 80 char line width limit by @zhuohan123 in #3305
- [docs] Add LoRA support information for models by @pcmoritz in #3299
- Add distributed model executor abstraction by @zhuohan123 in #3191
- [ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
- Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
- docs: Add BentoML deployment doc by @Sherlock113 in #3336
- Fixes #1556 double free by @br3no in #3347
- Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
- [Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
- add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
- fix bias in if, ambiguous by @hliuca in #3259
- [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
- Add batched RoPE kernel by @tterrysun in #3095
- Fix lint by @Yard1 in #3388
- [FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
- [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
- allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
- [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
- Install
flash_attn
in Docker image by @tdoublep in #3396 - Add args for mTLS support by @declark1 in #3410
- [issue templates] add some issue templates by @youkaichao in #3412
- Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
- fix marlin config repr by @qeternity in #3414
- Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
- [Misc] add HOST_IP env var by @youkaichao in #3419
- Add chat templates for Falcon by @Dinghow in #3420
- Add chat templates for ChatGLM by @Dinghow in #3418
- Fix
dist.broadcast
stall without group argument by @GindaChen in #3408 - Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
- [Fix] Add args for mTLS support by @declark1 in #3430
- Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
- [Misc] add error message in non linux platform by @youkaichao in #3438
- Fix issue templates by @hmellor in #3436
- fix document error for value and v_vec illustration by @laneeeee in #3421
- Asynchronous tokenization by @Yard1 in #2879
- Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
- [Misc] PR templates by @youkaichao in #3413
- Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
- Replace
lstrip()
withremoveprefix()
to fix Ruff linter warning by @ronensc in #2958 - Fix Baichuan chat template by @Dinghow in #3340
- ...
v0.3.3
Major changes
- StarCoder2 support
- Performance optimization and LoRA support for Gemma
- 2/3/8-bit GPTQ support
- Integrate Marlin Kernels for Int4 GPTQ inference
- Performance optimization for MoE kernel
- [Experimental] AWS Inferentia2 support
- [Experimental] Structured output (JSON, Regex) in OpenAI Server
What's Changed
- Update a comment in
benchmark_serving.py
by @ronensc in #2934 - Added early stopping to completion APIs by @Maxusmusti in #2939
- Migrate MistralForCausalLM to LlamaForCausalLM by @esmeetu in #2868
- Use Llama RMSNorm for Gemma by @WoosukKwon in #2974
- chore(vllm): codespell for spell checking by @mspronesti in #2820
- Optimize GeGLU layer in Gemma by @WoosukKwon in #2975
- [FIX] Fix issue #2904 by @44670 in #2983
- Remove Flash Attention in test env by @WoosukKwon in #2982
- Include tokens from prompt phase in
counter_generation_tokens
by @ronensc in #2802 - Fix nvcc not found in vllm-openai image by @zhaoyang-star in #2781
- [Fix] Fix assertion on Mistral YaRN model len by @WoosukKwon in #2984
- Port metrics from
aioprometheus
toprometheus_client
by @hmellor in #2730 - Add LogProbs for Chat Completions in OpenAI by @jlcmoore in #2918
- Optimized fused MoE Kernel, take 2 by @pcmoritz in #2979
- [Minor] Remove gather_cached_kv kernel by @WoosukKwon in #3043
- [Minor] Remove unused config file by @esmeetu in #3039
- Fix using CuPy for eager mode by @esmeetu in #3037
- Fix stablelm by @esmeetu in #3038
- Support Orion model by @dachengai in #2539
- fix
get_ip
error in pure ipv6 environment by @Jingru in #2931 - [Minor] Fix type annotation in fused moe by @WoosukKwon in #3045
- Support logit bias for OpenAI API by @dylanwhawk in #3027
- [Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by @WoosukKwon in #3046
- Enables GQA support in the prefix prefill kernels by @sighingnow in #3007
- multi-lora documentation fix by @ElefHead in #3064
- Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by @AllenDou in #3070
- Support inference with transformers-neuronx by @liangfu in #2569
- Add LoRA support for Gemma by @WoosukKwon in #3050
- Add Support for 2/3/8-bit GPTQ Quantization Models by @chu-tianxiang in #2330
- Fix:
AttributeError
in OpenAI-compatible server by @jaywonchung in #3018 - add cache_config's info to prometheus metrics. by @AllenDou in #3100
- Support starcoder2 architecture by @sh0416 in #3089
- Fix building from source on WSL by @aliencaocao in #3112
- [Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by @njhill in #3099
- Add guided decoding for OpenAI API server by @felixzhu555 in #2819
- Fix: Output text is always truncated in some models by @HyperdriveHustle in #3016
- Remove exclude_unset in streaming response by @sh0416 in #3143
- docs: Add tutorial on deploying vLLM model with KServe by @terrytangyuan in #2586
- fix relative import path of protocol.py by @Huarong in #3134
- Integrate Marlin Kernels for Int4 GPTQ inference by @robertgshaw2-neuralmagic in #2497
- Bump up to v0.3.3 by @WoosukKwon in #3129
New Contributors
- @Maxusmusti made their first contribution in #2939
- @44670 made their first contribution in #2983
- @jlcmoore made their first contribution in #2918
- @dachengai made their first contribution in #2539
- @dylanwhawk made their first contribution in #3027
- @ElefHead made their first contribution in #3064
- @AllenDou made their first contribution in #3070
- @jaywonchung made their first contribution in #3018
- @sh0416 made their first contribution in #3089
- @aliencaocao made their first contribution in #3112
- @felixzhu555 made their first contribution in #2819
- @HyperdriveHustle made their first contribution in #3016
- @terrytangyuan made their first contribution in #2586
- @Huarong made their first contribution in #3134
Full Changelog: v0.3.2...v0.3.3
v0.3.2
Major Changes
This version adds support for the OLMo and Gemma Model, as well as seed
parameter.
What's Changed
- Defensively copy
sampling_params
by @njhill in #2881 - multi-LoRA as extra models in OpenAI server by @jvmncs in #2775
- Add code-revision config argument for Hugging Face Hub by @mbm-ai in #2892
- [Minor] Small fix to make distributed init logic in worker looks cleaner by @zhuohan123 in #2905
- [Test] Add basic correctness test by @zhuohan123 in #2908
- Support OLMo models. by @Isotr0py in #2832
- Add warning to prevent changes to benchmark api server by @simon-mo in #2858
- Fix
vllm:prompt_tokens_total
metric calculation by @ronensc in #2869 - [ROCm] include gfx908 as supported by @jamestwhedbee in #2792
- [FIX] Fix beam search test by @zhuohan123 in #2930
- Make vLLM logging formatting optional by @Yard1 in #2877
- Add metrics to RequestOutput by @Yard1 in #2876
- Add Gemma model by @xiangxu-google in #2964
- Upgrade transformers to v4.38.0 by @WoosukKwon in #2965
- [FIX] Add Gemma model to the doc by @zhuohan123 in #2966
- [ROCm] Upgrade transformers to v4.38.0 by @WoosukKwon in #2967
- Support per-request seed by @njhill in #2514
- Bump up version to v0.3.2 by @zhuohan123 in #2968
New Contributors
- @jvmncs made their first contribution in #2775
- @mbm-ai made their first contribution in #2892
- @Isotr0py made their first contribution in #2832
- @jamestwhedbee made their first contribution in #2792
Full Changelog: v0.3.1...v0.3.2
v0.3.1
Major Changes
This version fixes the following major bugs:
- Memory leak with distributed execution. (Solved by using CuPY for collective communication).
- Support for Python 3.8.
Also with many smaller bug fixes listed below.
What's Changed
- Fixes assertion failure in prefix caching: the lora index mapping should respect
prefix_len
. by @sighingnow in #2688 - fix some bugs about parameter description by @zspo in #2689
- [Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
- Add unit test for Mixtral MoE layer by @pcmoritz in #2677
- Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
- Add Internlm2 by @Leymore in #2666
- Fix compile error when using rocm by @zhaoyang-star in #2648
- fix python 3.8 syntax by @simon-mo in #2716
- Update README for meetup slides by @simon-mo in #2718
- Use revision when downloading the quantization config file by @Pernekhan in #2697
- remove hardcoded
device="cuda"
to support more device by @jikunshang in #2503 - fix length_penalty default value to 1.0 by @zspo in #2667
- Add one example to run batch inference distributed on Ray by @c21 in #2696
- docs: update langchain serving instructions by @mspronesti in #2736
- Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
- Remove eos tokens from output by default by @zcnrex in #2611
- add requirement: triton >= 2.1.0 by @whyiug in #2746
- [Minor] Fix benchmark_latency by @WoosukKwon in #2765
- [ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
- Set local logging level via env variable by @gardberg in #2774
- [ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
- Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
- fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
- [Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
- [ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
- Add documentation on how to do incremental builds by @pcmoritz in #2796
- [Ray] Integration compiled DAG off by default by @rkooo567 in #2471
- Disable custom all reduce by default by @WoosukKwon in #2808
- [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
- Add documentation section about LoRA by @pcmoritz in #2834
- Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
- Serving Benchmark Refactoring by @ywang96 in #2433
- [CI] Ensure documentation build is checked in CI by @simon-mo in #2842
- Refactor llama family models by @esmeetu in #2637
- Revert "Refactor llama family models" by @pcmoritz in #2851
- Use CuPy for CUDA graphs by @WoosukKwon in #2811
- Remove Yi model definition, please use
LlamaForCausalLM
instead by @pcmoritz in #2854 - Add LoRA support for Mixtral by @tterrysun in #2831
- Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
- Fix internlm after #2860 by @pcmoritz in #2861
- [Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
- Fix docker python version by @NikolaBorisov in #2845
- Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
- Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
- Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
- [BugFix] Fix GC bug for
LLM
class by @WoosukKwon in #2882 - Fix decilm.py by @pcmoritz in #2883
- [ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
- Prefix Caching- fix t4 triton error by @caoshiyi in #2517
- Bump up to v0.3.1 by @WoosukKwon in #2887
New Contributors
- @sighingnow made their first contribution in #2688
- @rib-2 made their first contribution in #2316
- @Leymore made their first contribution in #2666
- @Pernekhan made their first contribution in #2697
- @jikunshang made their first contribution in #2503
- @c21 made their first contribution in #2696
- @zcnrex made their first contribution in #2611
- @whyiug made their first contribution in #2746
- @gardberg made their first contribution in #2774
- @dllehr-amd made their first contribution in #2627
- @rkooo567 made their first contribution in #2471
- @ywang96 made their first contribution in #2433
- @tterrysun made their first contribution in #2831
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Major Changes
- Experimental multi-lora support
- Experimental prefix caching support
- FP8 KV Cache support
- Optimized MoE performance and Deepseek MoE support
- CI tested PRs
- Support batch completion in server
What's Changed
- Miner fix of type hint by @beginlner in #2340
- Build docker image with shared objects from "build" step by @payoto in #2237
- Ensure metrics are logged regardless of requests by @ichernev in #2347
- Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
- Fix eager mode performance by @WoosukKwon in #2377
- [Minor] Remove unused code in attention by @WoosukKwon in #2384
- Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
- [Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
- Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
- [Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
- [DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
- [Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
- Add gradio chatbot for openai webserver by @arkohut in #2307
- [BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
- Allow setting fastapi root_path argument by @chiragjn in #2341
- Address Phi modeling update 2 by @huiwy in #2428
- Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
- Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
- Aligning
top_p
andtop_k
Sampling by @chenxu2048 in #1885 - [Minor] Fix err msg by @WoosukKwon in #2431
- [Minor] Optimize cuda graph memory usage by @esmeetu in #2437
- [CI] Add Buildkite by @simon-mo in #2355
- Announce the second vLLM meetup by @WoosukKwon in #2444
- Allow buildkite to retry build on agent lost by @simon-mo in #2446
- Fix weigit loading for GQA with TP by @zhangch9 in #2379
- CI: make sure benchmark script exit on error by @simon-mo in #2449
- ci: retry on build failure as well by @simon-mo in #2457
- Add StableLM3B model by @ita9naiwa in #2372
- OpenAI refactoring by @FlorianJoncour in #2360
- [Experimental] Prefix Caching Support by @caoshiyi in #1669
- fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
- Minor fix in prefill cache example by @JasonZhu1313 in #2494
- fix: fix some args desc by @zspo in #2487
- [Neuron] Add an option to build with neuron by @liangfu in #2065
- Don't download both safetensor and bin files. by @NikolaBorisov in #2480
- [BugFix] Fix abort_seq_group by @beginlner in #2463
- refactor completion api for readability by @simon-mo in #2499
- Support OpenAI API server in
benchmark_serving.py
by @hmellor in #2172 - Simplify broadcast logic for control messages by @zhuohan123 in #2501
- [Bugfix] fix load local safetensors model by @esmeetu in #2512
- Add benchmark serving to CI by @simon-mo in #2505
- Add
group
as an argument in broadcast ops by @GindaChen in #2522 - [Fix] Keep
scheduler.running
as deque by @njhill in #2523 - migrate pydantic from v1 to v2 by @joennlae in #2531
- [Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
- Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
- Add qwen2 by @JustinLin610 in #2495
- Fix progress bar and allow HTTPS in
benchmark_serving.py
by @hmellor in #2552 - Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
- [Feature] Simple API token authentication by @taisazero in #1106
- Add multi-LoRA support by @Yard1 in #1804
- lint: format all python file instead of just source code by @simon-mo in #2567
- [Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
- Added
include_stop_str_in_output
andlength_penalty
parameters to OpenAI API by @galatolofederico in #2562 - [Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
- Support Batch Completion in Server by @simon-mo in #2529
- fix names and license by @JustinLin610 in #2589
- [Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
- [ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
- Support for Stable LM 2 by @dakotamahan-stability in #2598
- Don't build punica kernels by default by @pcmoritz in #2605
- AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
- Use head_dim in config if exists by @xiangxu-google in #2622
- Custom all reduce kernels by @hanzhi713 in #2192
- [Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
- Speed up Punica compilation by @WoosukKwon in #2632
- Small async_llm_engine refactor by @andoorve in #2618
- Update Ray version requirements by @simon-mo in #2636
- Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
- Fix error when tp > 1 by @zhaoyang-star in #2644
- No repeated IPC open by @hanzhi713 in #2642
- ROCm: Allow setting compilation target by @rlrs in #2581
- DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
- Fused MOE for Mixtral by @pcmoritz in #2542
- Fix 'Actor methods cannot be called directly' when using
--engine-use-ray
by @HermitSun in #2664 - Add swap_blocks unit tests by @sh1ng in #2616
- Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
- [Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
- Add quantized mixtral support by @WoosukKwon in #2673
- Bump up version to v0.3.0 by @zhuohan123 in #2656
New Contributors
- @payoto made their first contribution in #2237
- @NadavShmayo made their first contribution in #2290
- @EvilPsyCHo made their first contribution in #2390
- @litone01 made their first contribution in #1011
- @arkohut made their first contribution in #2307
- @chiragjn made their first contribution in #2341
- @huiwy made their first contribution in #2428
- @chuanzhubin made their first contribution in #2374
- @nautsimon made their first contribution in #2369
- @zhangch9 made their first contribution in #2379
- @ita9naiwa made their first contribution in #2372
- @caoshiyi made their first contribution in https://gi...
v0.2.7
Major Changes
- Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
- Fix tensor parallelism support for Mixtral + GPTQ/AWQ
What's Changed
- Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
- [BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
- [BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
- Add SSL arguments to API servers by @hmellor in #2109
- typo fix by @oushu1zhangxiangxuan1 in #2166
- [ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
- Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
- [Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
- Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
- [BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
- Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
- Remove Sampler copy stream by @Yard1 in #2209
- Fix a broken link by @ronensc in #2222
- Disable Ray usage stats collection by @WoosukKwon in #2206
- [BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
- Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
- Add "About" Heading to README.md by @blueceiling in #2260
- [BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
- [BUGFIX] Fix API server test by @zhuohan123 in #2270
- [BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
- [BUGFIX] Fix communication test by @zhuohan123 in #2285
- Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
- [FIX] Fix kernel bug by @jeejeelee in #1959
- fix typo and remove unused code by @esmeetu in #2305
- Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
- Fix Gradio example: remove deprecated parameter
concurrency_count
by @ronensc in #2315 - Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
- Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
- [Minor] Revert the changes in test_cache by @WoosukKwon in #2335
- Bump up to v0.2.7 by @WoosukKwon in #2337
New Contributors
- @SuhongMoon made their first contribution in #2162
- @hmellor made their first contribution in #2109
- @oushu1zhangxiangxuan1 made their first contribution in #2166
- @kliuae made their first contribution in #2180
- @avideci made their first contribution in #2062
- @hanzhi713 made their first contribution in #2207
- @ronensc made their first contribution in #2222
- @skt7 made their first contribution in #2246
- @blueceiling made their first contribution in #2260
- @dalgarak made their first contribution in #2301
Full Changelog: v0.2.6...v0.2.7
v0.2.6
Major changes
- Fast model execution with CUDA/HIP graph
- W4A16 GPTQ support (thanks to @chu-tianxiang)
- Fix memory profiling with tensor parallelism
- Fix *.bin weight loading for Mixtral models
What's Changed
- Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
- Fix Dockerfile.rocm by @tjtanaa in #2101
- avoid multiple redefinition by @MitchellX in #1817
- Add a flag to include stop string in output text by @yunfeng-scale in #1976
- Add GPTQ support by @chu-tianxiang in #916
- [Docs] Add quantization support to docs by @WoosukKwon in #2135
- [ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
- simplify loading weights logic by @esmeetu in #2133
- Optimize model execution with CUDA graph by @WoosukKwon in #1926
- [Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
- Fix all-reduce memory usage by @WoosukKwon in #2151
- Pin PyTorch & xformers versions by @WoosukKwon in #2155
- Remove dependency on CuPy by @WoosukKwon in #2152
- [Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
- Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
- [Minor] Add more detailed explanation on
quantization
argument by @WoosukKwon in #2145 - [Minor] Fix xformers version by @WoosukKwon in #2158
- [Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
- Make sampler less blocking by @Yard1 in #1889
- [Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
- Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
- Bump up to v0.2.6 by @WoosukKwon in #2157
New Contributors
- @mezuzza made their first contribution in #2100
- @MitchellX made their first contribution in #1817
Full Changelog: v0.2.5...v0.2.6