Releases · vllm-project/vllm

05 May 04:31

github-actions

v0.4.2

c7f2cf2

v0.4.2 Latest

Latest

Highlights

Features

Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
Support FlashInfer as attention backend (#4353)

Models and Enhancements

Add support for Phi-3-mini (#4298, #4372, #4380)
Add more histogram metrics (#2764, #4523)
Full tensor parallelism for LoRA layers (#3524)
Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)

Dependency Upgrade

Upgrade to torch==2.3.0 (#4454)
Upgrade to tensorizer==2.9.0 (#4467)
Expansion of AMD test suite (#4267)

Progress and Dev Experience

Centralize and document all environment variables (#4548, #4574)
Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
Progress towards pipeline parallelism (#4512, #4444, #4566)
Progress towards multiprocessing based executors (#4348, #4402, #4419)
Progress towards FP8 support (#4343, #4332, 4527)

What's Changed

[Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
[Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
[Doc] Add note for docker user by @youkaichao in #4340
[Misc] Use public API in benchmark_throughput by @zifeitong in #4300
[Model] Adds Phi-3 support by @caiom in #4298
[Core] Move ray_utils.py from engine to executor package by @njhill in #4347
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
[CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
[Doc] README Phi-3 name fix. by @caiom in #4372
[Core]refactor aqlm quant ops by @jikunshang in #4351
[Mypy] Typing lora folder by @rkooo567 in #4337
[Misc] Optimize flash attention backend log by @esmeetu in #4368
[Core] Add shutdown() method to ExecutorBase by @njhill in #4349
[Core] Move function tracing setup to util function by @njhill in #4352
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
[Bugfix] Fix parameter name in get_tokenizer by @DarkLight1337 in #4107
[Frontend] Add --log-level option to api server by @normster in #4377
[CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
[Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
[Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
[Misc] add RFC issue template by @youkaichao in #4401
[Core] Introduce DistributedGPUExecutor abstract class by @njhill in #4348
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
[Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
[Misc] Fix logger format typo by @esmeetu in #4396
[ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
[Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
[Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
[Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
[Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
[BugFix] Fix min_tokens when eos_token_id is None by @njhill in #4389
✨ support local cache for models by @prashantgupta24 in #4374
[BugFix] Fix return type of executor execute_model methods by @njhill in #4402
[BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
[Misc] fix typo in llm_engine init logging by @DefTruth in #4428
Add more Prometheus metrics by @ronensc in #2764
[CI] clean docker cache for neuron by @simon-mo in #4441
[mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
[CI] hotfix: soft fail neuron test by @simon-mo in #4458
[Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
[Misc] Upgrade to torch==2.3.0 by @mgoin in #4454
[Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
[Core]Refactor gptq_marlin ops by @jikunshang in #4466
[BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
[Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
[Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version by @alpayariyak in #4467
[Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
fix_tokenizer_snapshot_download_bug by @kingljl in #4493
Unable to find Punica extension issue during source code installation by @kingljl in #4494
[Core] Centralize GPU Worker construction by @njhill in #4419
[Misc][Typo] type annotation fix by @HarryWu99 in #4495
[Misc] fix typo in block manager by @Juelianqvq in #4453
Allow user to define whitespace pattern for outlines by @robcaulk in #4305
[Misc]Add customized information for models by @jeejeelee in #4132
[Test] Add ignore_eos test by @rkooo567 in #4519
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
[Bugfix] Fix 307 Redirect for /metrics by @robertgshaw2-neuralmagic in #4523
[Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
[Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
[Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
[Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
[Core] Add multiproc_worker_utils for multiprocessing-based workers by @njhill in #4357
[Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
[Bugfix] Add validation for seed by @sasha0552 in #4529
[Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
[Core][Distributed] fix pynccl del error by @youkaichao in #4508
[Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
[Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
[MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens i...

Contributors

markmc, pcmoritz, and 45 other contributors

Assets 10

24 Apr 02:28

github-actions

v0.4.1

468d761

v0.4.1

Highlights

Features

Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
Support private model registration, and updating our support policy (#3871, 3948)
Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
Add option for using LM Format Enforcer for guided decoding (#3868)
Add option for optionally initialize tokenizer and detokenizer (#3748)
Add option for load model using tensorizer (#3476)

Enhancements

vLLM is now mostly type checked by mypy (#3816, #4006, #4161, #4043)
Progress towards chunked prefill scheduler (#3550, #3853, #4280, #3884)
Progress towards speculative decoding (#3250, #3706, #3894)
Initial support with dynamic per-tensor scaling via FP8 (#4118)

Hardwares

Intel CPU inference backend is added (#3993, #3634)
AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in #3662
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
[Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
[Misc] Some minor simplifications to detokenization logic by @njhill in #3670
[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
[Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
[Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
[Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
[BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
[Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
[Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
[3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
[Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
Fixes the argument for local_tokenizer_group by @sighingnow in #3754
[Core] Enable hf_transfer by default if available by @michaelfeil in #3817
[Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
[Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
[Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
[Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
[Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
[Model] Cohere CommandR+ by @saurabhdash2512 in #3829
[Core] improve robustness of pynccl by @youkaichao in #3860
[Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
[Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
[Bugfix] Fixing requirements.txt by @noamgat in #3865
[Misc] Define common requirements by @WoosukKwon in #3841
Add option to completion API to truncate prompt tokens by @tdoublep in #3144
[Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
[CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
[Core] enable out-of-tree model register by @youkaichao in #3871
[WIP][Core] latency optimization by @youkaichao in #3890
[Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
[Model] add minicpm by @SUDA-HLT-ywfang in #3893
[Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
[BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
[Core] separate distributed_init from worker by @youkaichao in #3904
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
[Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
[Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
[Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
[Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
[Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
[Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
[Doc] Add doc to state our model support policy by @youkaichao in #3948
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
[Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
[Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
[Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
[Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
[Test] Add xformer and flash attn tests by @rkooo567 in #3961
[Misc] refactor ops and cache_ops layer by @jikunshang in #3913
[Doc][Installation] delete python setup.py develop by @youkaichao in #3989
[Ke...

Contributors

pcmoritz, Qubitium, and 68 other contributors

Assets 10

02 Apr 20:01

github-actions

v0.4.0.post1

a3c226e

v0.4.0.post1, restore sm70/75 support

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

[Kernel] Layernorm performance optimization by @mawong-amd in #3662
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
[CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
[Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
[Misc] Some minor simplifications to detokenization logic by @njhill in #3670
[Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
[Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
[Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
[HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
[Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
[Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803

New Contributors

@mawong-amd made their first contribution in #3662
@Qubitium made their first contribution in #3689
@bigPYJ1151 made their first contribution in #3634
@A-Mahla made their first contribution in #3788

Full Changelog: v0.4.0...v0.4.0.post1

Contributors

Qubitium, cadedaniel, and 10 other contributors

Assets 10

30 Mar 01:54

github-actions

v0.4.0

51c31bc

v0.4.0

Major changes

Models

New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
New vision language model: LLaVA (#3042)

Production features

Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
Custom all reduce kernel has been re-enabled after more robustness fixes.
Replaced cupy dependency due to its bugs.

Hardware

Improved Neuron support for AWS Inferentia.
CMake based build system for extensibility.

Ecosystem

Extensive serving benchmark refactoring (#3277)
Usage statistics collection (#2852)

What's Changed

allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
Add Automatic Prefix Caching by @SageMoore in #2762
Add vLLM version info to logs and openai API server by @jasonacox in #3161
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
Make it easy to profile workers with nsight by @pcmoritz in #3162
[DOC] add setup document to support neuron backend by @liangfu in #2777
[Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
Add document for vllm paged attention kernel. by @pian13131 in #2978
enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
[Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
Push logprob generation to LLMEngine by @Yard1 in #3065
Add health check, make async Engine more robust by @Yard1 in #3015
Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
Store eos_token_id in Sequence for easy access by @njhill in #3166
[Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
[Tests] Add block manager and scheduler tests by @rkooo567 in #3108
[Testing] Fix core tests by @cadedaniel in #3224
A simple addition of dynamic_ncols=True by @chujiezheng in #3242
Add GPTQ support for Gemma by @TechxGenus in #3200
Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
Separate attention backends by @WoosukKwon in #3005
Measure model memory usage by @mgoin in #3120
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
Fix auto prefix bug by @ElizaWszola in #3239
Connect engine healthcheck to openai server by @njhill in #3260
Feature add lora support for Qwen2 by @whyiug in #3177
[Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
[Docs] Fix Unmocked Imports by @ywang96 in #3275
[FIX] Make flash_attn optional by @WoosukKwon in #3269
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir by @mgoin in #3241
[FIX] Fix prefix test error on main by @zhuohan123 in #3286
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
Enhance lora tests with more layer and rank variations by @tterrysun in #3243
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
[BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
[Fix] Fix best_of behavior when n=1 by @njhill in #3298
Re-enable the 80 char line width limit by @zhuohan123 in #3305
[docs] Add LoRA support information for models by @pcmoritz in #3299
Add distributed model executor abstraction by @zhuohan123 in #3191
[ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
docs: Add BentoML deployment doc by @Sherlock113 in #3336
Fixes #1556 double free by @br3no in #3347
Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
[Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
fix bias in if, ambiguous by @hliuca in #3259
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
Add batched RoPE kernel by @tterrysun in #3095
Fix lint by @Yard1 in #3388
[FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
Install flash_attn in Docker image by @tdoublep in #3396
Add args for mTLS support by @declark1 in #3410
[issue templates] add some issue templates by @youkaichao in #3412
Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
fix marlin config repr by @qeternity in #3414
Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
[Misc] add HOST_IP env var by @youkaichao in #3419
Add chat templates for Falcon by @Dinghow in #3420
Add chat templates for ChatGLM by @Dinghow in #3418
Fix dist.broadcast stall without group argument by @GindaChen in #3408
Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
[Fix] Add args for mTLS support by @declark1 in #3430
Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
[Misc] add error message in non linux platform by @youkaichao in #3438
Fix issue templates by @hmellor in #3436
fix document error for value and v_vec illustration by @laneeeee in #3421
Asynchronous tokenization by @Yard1 in #2879
Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
[Misc] PR templates by @youkaichao in #3413
Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
Replace lstrip() with removeprefix() to fix Ruff linter warning by @ronensc in #2958
Fix Baichuan chat template by @Dinghow in #3340
...

Contributors

orsharir, pcmoritz, and 72 other contributors

Assets 10

01 Mar 20:58

github-actions

v0.3.3

82091b8

v0.3.3

Major changes

StarCoder2 support
Performance optimization and LoRA support for Gemma
2/3/8-bit GPTQ support
Integrate Marlin Kernels for Int4 GPTQ inference
Performance optimization for MoE kernel
[Experimental] AWS Inferentia2 support
[Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

Update a comment in benchmark_serving.py by @ronensc in #2934
Added early stopping to completion APIs by @Maxusmusti in #2939
Migrate MistralForCausalLM to LlamaForCausalLM by @esmeetu in #2868
Use Llama RMSNorm for Gemma by @WoosukKwon in #2974
chore(vllm): codespell for spell checking by @mspronesti in #2820
Optimize GeGLU layer in Gemma by @WoosukKwon in #2975
[FIX] Fix issue #2904 by @44670 in #2983
Remove Flash Attention in test env by @WoosukKwon in #2982
Include tokens from prompt phase in counter_generation_tokens by @ronensc in #2802
Fix nvcc not found in vllm-openai image by @zhaoyang-star in #2781
[Fix] Fix assertion on Mistral YaRN model len by @WoosukKwon in #2984
Port metrics from aioprometheus to prometheus_client by @hmellor in #2730
Add LogProbs for Chat Completions in OpenAI by @jlcmoore in #2918
Optimized fused MoE Kernel, take 2 by @pcmoritz in #2979
[Minor] Remove gather_cached_kv kernel by @WoosukKwon in #3043
[Minor] Remove unused config file by @esmeetu in #3039
Fix using CuPy for eager mode by @esmeetu in #3037
Fix stablelm by @esmeetu in #3038
Support Orion model by @dachengai in #2539
fix get_ip error in pure ipv6 environment by @Jingru in #2931
[Minor] Fix type annotation in fused moe by @WoosukKwon in #3045
Support logit bias for OpenAI API by @dylanwhawk in #3027
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM by @WoosukKwon in #3046
Enables GQA support in the prefix prefill kernels by @sighingnow in #3007
multi-lora documentation fix by @ElefHead in #3064
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs by @AllenDou in #3070
Support inference with transformers-neuronx by @liangfu in #2569
Add LoRA support for Gemma by @WoosukKwon in #3050
Add Support for 2/3/8-bit GPTQ Quantization Models by @chu-tianxiang in #2330
Fix: AttributeError in OpenAI-compatible server by @jaywonchung in #3018
add cache_config's info to prometheus metrics. by @AllenDou in #3100
Support starcoder2 architecture by @sh0416 in #3089
Fix building from source on WSL by @aliencaocao in #3112
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams by @njhill in #3099
Add guided decoding for OpenAI API server by @felixzhu555 in #2819
Fix: Output text is always truncated in some models by @HyperdriveHustle in #3016
Remove exclude_unset in streaming response by @sh0416 in #3143
docs: Add tutorial on deploying vLLM model with KServe by @terrytangyuan in #2586
fix relative import path of protocol.py by @Huarong in #3134
Integrate Marlin Kernels for Int4 GPTQ inference by @robertgshaw2-neuralmagic in #2497
Bump up to v0.3.3 by @WoosukKwon in #3129

New Contributors

@Maxusmusti made their first contribution in #2939
@44670 made their first contribution in #2983
@jlcmoore made their first contribution in #2918
@dachengai made their first contribution in #2539
@dylanwhawk made their first contribution in #3027
@ElefHead made their first contribution in #3064
@AllenDou made their first contribution in #3070
@jaywonchung made their first contribution in #3018
@sh0416 made their first contribution in #3089
@aliencaocao made their first contribution in #3112
@felixzhu555 made their first contribution in #2819
@HyperdriveHustle made their first contribution in #3016
@terrytangyuan made their first contribution in #2586
@Huarong made their first contribution in #3134

Full Changelog: v0.3.2...v0.3.3

Contributors

pcmoritz, liangfu, and 25 other contributors

Assets 10

21 Feb 19:50

github-actions

v0.3.2

8fbd84b

v0.3.2

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

Defensively copy sampling_params by @njhill in #2881
multi-LoRA as extra models in OpenAI server by @jvmncs in #2775
Add code-revision config argument for Hugging Face Hub by @mbm-ai in #2892
[Minor] Small fix to make distributed init logic in worker looks cleaner by @zhuohan123 in #2905
[Test] Add basic correctness test by @zhuohan123 in #2908
Support OLMo models. by @Isotr0py in #2832
Add warning to prevent changes to benchmark api server by @simon-mo in #2858
Fix vllm:prompt_tokens_total metric calculation by @ronensc in #2869
[ROCm] include gfx908 as supported by @jamestwhedbee in #2792
[FIX] Fix beam search test by @zhuohan123 in #2930
Make vLLM logging formatting optional by @Yard1 in #2877
Add metrics to RequestOutput by @Yard1 in #2876
Add Gemma model by @xiangxu-google in #2964
Upgrade transformers to v4.38.0 by @WoosukKwon in #2965
[FIX] Add Gemma model to the doc by @zhuohan123 in #2966
[ROCm] Upgrade transformers to v4.38.0 by @WoosukKwon in #2967
Support per-request seed by @njhill in #2514
Bump up version to v0.3.2 by @zhuohan123 in #2968

New Contributors

@jvmncs made their first contribution in #2775
@mbm-ai made their first contribution in #2892
@Isotr0py made their first contribution in #2832
@jamestwhedbee made their first contribution in #2792

Full Changelog: v0.3.1...v0.3.2

Contributors

jvmncs, Yard1, and 9 other contributors

Assets 10

16 Feb 23:06

github-actions

v0.3.1

5f08050

v0.3.1

Major Changes

This version fixes the following major bugs:

Memory leak with distributed execution. (Solved by using CuPY for collective communication).
Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len. by @sighingnow in #2688
fix some bugs about parameter description by @zspo in #2689
[Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
Add unit test for Mixtral MoE layer by @pcmoritz in #2677
Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
Add Internlm2 by @Leymore in #2666
Fix compile error when using rocm by @zhaoyang-star in #2648
fix python 3.8 syntax by @simon-mo in #2716
Update README for meetup slides by @simon-mo in #2718
Use revision when downloading the quantization config file by @Pernekhan in #2697
remove hardcoded device="cuda" to support more device by @jikunshang in #2503
fix length_penalty default value to 1.0 by @zspo in #2667
Add one example to run batch inference distributed on Ray by @c21 in #2696
docs: update langchain serving instructions by @mspronesti in #2736
Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
Remove eos tokens from output by default by @zcnrex in #2611
add requirement: triton >= 2.1.0 by @whyiug in #2746
[Minor] Fix benchmark_latency by @WoosukKwon in #2765
[ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
Set local logging level via env variable by @gardberg in #2774
[ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
[Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
Add documentation on how to do incremental builds by @pcmoritz in #2796
[Ray] Integration compiled DAG off by default by @rkooo567 in #2471
Disable custom all reduce by default by @WoosukKwon in #2808
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
Add documentation section about LoRA by @pcmoritz in #2834
Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
Serving Benchmark Refactoring by @ywang96 in #2433
[CI] Ensure documentation build is checked in CI by @simon-mo in #2842
Refactor llama family models by @esmeetu in #2637
Revert "Refactor llama family models" by @pcmoritz in #2851
Use CuPy for CUDA graphs by @WoosukKwon in #2811
Remove Yi model definition, please use LlamaForCausalLM instead by @pcmoritz in #2854
Add LoRA support for Mixtral by @tterrysun in #2831
Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
Fix internlm after #2860 by @pcmoritz in #2861
[Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
Fix docker python version by @NikolaBorisov in #2845
Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
[BugFix] Fix GC bug for LLM class by @WoosukKwon in #2882
Fix decilm.py by @pcmoritz in #2883
[ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
Prefix Caching- fix t4 triton error by @caoshiyi in #2517
Bump up to v0.3.1 by @WoosukKwon in #2887

New Contributors

@sighingnow made their first contribution in #2688
@rib-2 made their first contribution in #2316
@Leymore made their first contribution in #2666
@Pernekhan made their first contribution in #2697
@jikunshang made their first contribution in #2503
@c21 made their first contribution in #2696
@zcnrex made their first contribution in #2611
@whyiug made their first contribution in #2746
@gardberg made their first contribution in #2774
@dllehr-amd made their first contribution in #2627
@rkooo567 made their first contribution in #2471
@ywang96 made their first contribution in #2433
@tterrysun made their first contribution in #2831

Full Changelog: v0.3.0...v0.3.1

Contributors

pcmoritz, NikolaBorisov, and 24 other contributors

Assets 10

31 Jan 08:07

github-actions

v0.3.0

1af090b

v0.3.0

Major Changes

Experimental multi-lora support
Experimental prefix caching support
FP8 KV Cache support
Optimized MoE performance and Deepseek MoE support
CI tested PRs
Support batch completion in server

What's Changed

Miner fix of type hint by @beginlner in #2340
Build docker image with shared objects from "build" step by @payoto in #2237
Ensure metrics are logged regardless of requests by @ichernev in #2347
Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
Fix eager mode performance by @WoosukKwon in #2377
[Minor] Remove unused code in attention by @WoosukKwon in #2384
Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
[Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
[Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
[Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
Add gradio chatbot for openai webserver by @arkohut in #2307
[BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
Allow setting fastapi root_path argument by @chiragjn in #2341
Address Phi modeling update 2 by @huiwy in #2428
Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
Aligning top_p and top_k Sampling by @chenxu2048 in #1885
[Minor] Fix err msg by @WoosukKwon in #2431
[Minor] Optimize cuda graph memory usage by @esmeetu in #2437
[CI] Add Buildkite by @simon-mo in #2355
Announce the second vLLM meetup by @WoosukKwon in #2444
Allow buildkite to retry build on agent lost by @simon-mo in #2446
Fix weigit loading for GQA with TP by @zhangch9 in #2379
CI: make sure benchmark script exit on error by @simon-mo in #2449
ci: retry on build failure as well by @simon-mo in #2457
Add StableLM3B model by @ita9naiwa in #2372
OpenAI refactoring by @FlorianJoncour in #2360
[Experimental] Prefix Caching Support by @caoshiyi in #1669
fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
Minor fix in prefill cache example by @JasonZhu1313 in #2494
fix: fix some args desc by @zspo in #2487
[Neuron] Add an option to build with neuron by @liangfu in #2065
Don't download both safetensor and bin files. by @NikolaBorisov in #2480
[BugFix] Fix abort_seq_group by @beginlner in #2463
refactor completion api for readability by @simon-mo in #2499
Support OpenAI API server in benchmark_serving.py by @hmellor in #2172
Simplify broadcast logic for control messages by @zhuohan123 in #2501
[Bugfix] fix load local safetensors model by @esmeetu in #2512
Add benchmark serving to CI by @simon-mo in #2505
Add group as an argument in broadcast ops by @GindaChen in #2522
[Fix] Keep scheduler.running as deque by @njhill in #2523
migrate pydantic from v1 to v2 by @joennlae in #2531
[Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
Add qwen2 by @JustinLin610 in #2495
Fix progress bar and allow HTTPS in benchmark_serving.py by @hmellor in #2552
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
[Feature] Simple API token authentication by @taisazero in #1106
Add multi-LoRA support by @Yard1 in #1804
lint: format all python file instead of just source code by @simon-mo in #2567
[Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
Added include_stop_str_in_output and length_penalty parameters to OpenAI API by @galatolofederico in #2562
[Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
Support Batch Completion in Server by @simon-mo in #2529
fix names and license by @JustinLin610 in #2589
[Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
[ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
Support for Stable LM 2 by @dakotamahan-stability in #2598
Don't build punica kernels by default by @pcmoritz in #2605
AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
Use head_dim in config if exists by @xiangxu-google in #2622
Custom all reduce kernels by @hanzhi713 in #2192
[Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
Speed up Punica compilation by @WoosukKwon in #2632
Small async_llm_engine refactor by @andoorve in #2618
Update Ray version requirements by @simon-mo in #2636
Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
Fix error when tp > 1 by @zhaoyang-star in #2644
No repeated IPC open by @hanzhi713 in #2642
ROCm: Allow setting compilation target by @rlrs in #2581
DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
Fused MOE for Mixtral by @pcmoritz in #2542
Fix 'Actor methods cannot be called directly' when using --engine-use-ray by @HermitSun in #2664
Add swap_blocks unit tests by @sh1ng in #2616
Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
[Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
Add quantized mixtral support by @WoosukKwon in #2673
Bump up version to v0.3.0 by @zhuohan123 in #2656

New Contributors

@payoto made their first contribution in #2237
@NadavShmayo made their first contribution in #2290
@EvilPsyCHo made their first contribution in #2390
@litone01 made their first contribution in #1011
@arkohut made their first contribution in #2307
@chiragjn made their first contribution in #2341
@huiwy made their first contribution in #2428
@chuanzhubin made their first contribution in #2374
@nautsimon made their first contribution in #2369
@zhangch9 made their first contribution in #2379
@ita9naiwa made their first contribution in #2372
@caoshiyi made their first contribution in https://gi...

Contributors

pcmoritz, NikolaBorisov, and 46 other contributors

Assets 10

04 Jan 01:36

github-actions

v0.2.7

2e0b6e7

v0.2.7

Major Changes

Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
[BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
[BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
Add SSL arguments to API servers by @hmellor in #2109
typo fix by @oushu1zhangxiangxuan1 in #2166
[ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
[Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
[BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
Remove Sampler copy stream by @Yard1 in #2209
Fix a broken link by @ronensc in #2222
Disable Ray usage stats collection by @WoosukKwon in #2206
[BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
Add "About" Heading to README.md by @blueceiling in #2260
[BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
[BUGFIX] Fix API server test by @zhuohan123 in #2270
[BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
[BUGFIX] Fix communication test by @zhuohan123 in #2285
Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
[FIX] Fix kernel bug by @jeejeelee in #1959
fix typo and remove unused code by @esmeetu in #2305
Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
Fix Gradio example: remove deprecated parameter concurrency_count by @ronensc in #2315
Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
[Minor] Revert the changes in test_cache by @WoosukKwon in #2335
Bump up to v0.2.7 by @WoosukKwon in #2337

New Contributors

@SuhongMoon made their first contribution in #2162
@hmellor made their first contribution in #2109
@oushu1zhangxiangxuan1 made their first contribution in #2166
@kliuae made their first contribution in #2180
@avideci made their first contribution in #2062
@hanzhi713 made their first contribution in #2207
@ronensc made their first contribution in #2222
@skt7 made their first contribution in #2246
@blueceiling made their first contribution in #2260
@dalgarak made their first contribution in #2301

Full Changelog: v0.2.6...v0.2.7

Contributors

esmeetu, Yard1, and 13 other contributors

Assets 10

17 Dec 18:35

github-actions

v0.2.6

671af2b

v0.2.6

Major changes

Fast model execution with CUDA/HIP graph
W4A16 GPTQ support (thanks to @chu-tianxiang)
Fix memory profiling with tensor parallelism
Fix *.bin weight loading for Mixtral models

What's Changed

Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
Fix Dockerfile.rocm by @tjtanaa in #2101
avoid multiple redefinition by @MitchellX in #1817
Add a flag to include stop string in output text by @yunfeng-scale in #1976
Add GPTQ support by @chu-tianxiang in #916
[Docs] Add quantization support to docs by @WoosukKwon in #2135
[ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
simplify loading weights logic by @esmeetu in #2133
Optimize model execution with CUDA graph by @WoosukKwon in #1926
[Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
Fix all-reduce memory usage by @WoosukKwon in #2151
Pin PyTorch & xformers versions by @WoosukKwon in #2155
Remove dependency on CuPy by @WoosukKwon in #2152
[Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
[Minor] Add more detailed explanation on quantization argument by @WoosukKwon in #2145
[Minor] Fix xformers version by @WoosukKwon in #2158
[Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
Make sampler less blocking by @Yard1 in #1889
[Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
Bump up to v0.2.6 by @WoosukKwon in #2157

New Contributors

@mezuzza made their first contribution in #2100
@MitchellX made their first contribution in #1817

Full Changelog: v0.2.5...v0.2.6

Contributors

mezuzza, esmeetu, and 6 other contributors

Assets 10

Releases: vllm-project/vllm

v0.4.2

Highlights

Features

Models and Enhancements

Dependency Upgrade

Progress and Dev Experience

What's Changed

Contributors

v0.4.1

Highlights

What's Changed

Contributors

v0.4.0.post1, restore sm70/75 support

Highlight

What's Changed

New Contributors

Contributors

v0.4.0

Major changes

Models

Production features

Hardware

Ecosystem

What's Changed

Contributors

v0.3.3

Major changes

What's Changed

New Contributors

Contributors

v0.3.2

Major Changes

What's Changed

New Contributors

Contributors

v0.3.1

Major Changes

What's Changed

New Contributors

Contributors

v0.3.0

Major Changes

What's Changed

New Contributors

Contributors

v0.2.7

Major Changes

What's Changed

New Contributors

Contributors

v0.2.6

Major changes

What's Changed

New Contributors

Contributors