Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Upstream encoder/decoder support based on multiple blocktables #161

Draft
wants to merge 233 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
233 commits
Select commit Hold shift + click to select a range
d7f3964
Update comment (#2934)
ronensc Feb 22, 2024
5574081
Added early stopping to completion APIs (#2939)
Maxusmusti Feb 22, 2024
344020c
Migrate MistralForCausalLM to LlamaForCausalLM (#2868)
esmeetu Feb 22, 2024
95529e3
Use Llama RMSNorm custom op for Gemma (#2974)
WoosukKwon Feb 22, 2024
93dc5a2
chore(vllm): codespell for spell checking (#2820)
mspronesti Feb 22, 2024
fd5dcc5
Optimize GeGLU layer in Gemma (#2975)
WoosukKwon Feb 22, 2024
c530e2c
[FIX] Fix a bug in initializing Yarn RoPE (#2983)
44670 Feb 22, 2024
6f32cdd
Remove Flash Attention in test env (#2982)
WoosukKwon Feb 22, 2024
4caf704
Include tokens from prompt phase in `counter_generation_tokens` (#2802)
ronensc Feb 22, 2024
57f0449
Fix nvcc not found in vlm-openai image (#2781)
zhaoyang-star Feb 22, 2024
f7c1234
[Fix] Fissertion on YaRN model len (#2984)
WoosukKwon Feb 23, 2024
ef978fe
Port metrics from `aioprometheus` to `prometheus_client` (#2730)
hmellor Feb 25, 2024
70f3e8e
Add LogProbs for Chat Completions in OpenAI (#2918)
jlcmoore Feb 26, 2024
cfc15a1
Optimize Triton MoE Kernel (#2979)
pcmoritz Feb 26, 2024
d6e4a13
[Minor] Remove gather_cached_kv kernel (#3043)
WoosukKwon Feb 26, 2024
d9f726c
[Minor] Remove unused config files (#3039)
esmeetu Feb 27, 2024
c1c0d00
Don't use cupy when `enforce_eager=True` (#3037)
esmeetu Feb 27, 2024
4dd6416
Fix stablelm (#3038)
esmeetu Feb 27, 2024
48a8f4a
Support Orion model (#2539)
dachengai Feb 27, 2024
2410e32
fix `get_ip` error in pure ipv6 environment (#2931)
Jingru Feb 27, 2024
4bd18ec
[Minor] Fix type annotation in fused moe (#3045)
WoosukKwon Feb 27, 2024
e0ade06
Support logit bias for OpenAI API (#3027)
dylanwhawk Feb 27, 2024
8b430d7
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM (#3046)
WoosukKwon Feb 27, 2024
71bcaf9
Enable GQA support in the prefix prefill kernels (#3007)
sighingnow Feb 27, 2024
a868310
multi-lora documentation fix (#3064)
ElefHead Feb 28, 2024
e46fa5d
Restrict prometheus_client >= 0.18.0 to prevent errors when importing…
AllenDou Feb 28, 2024
3b7178c
[Neuron] Support inference with transformers-neuronx (#2569)
liangfu Feb 28, 2024
929b4f2
Add LoRA support for Gemma (#3050)
WoosukKwon Feb 28, 2024
dd82ba3
t5-small
Feb 18, 2024
f2fd579
fix
js8544 Feb 29, 2024
01a5d18
Add Support for 2/3/8-bit GPTQ Quantization Models (#2330)
chu-tianxiang Feb 29, 2024
a6d471c
Fix: `AttributeError` in OpenAI-compatible server (#3018)
jaywonchung Feb 29, 2024
9289e57
add cache_config's info to prometheus metrics. (#3100)
AllenDou Feb 29, 2024
bfdcfa6
Support starcoder2 architecture (#3089)
sh0416 Feb 29, 2024
2fb6905
lint
js8544 Feb 29, 2024
2c08ff2
Fix building from source on WSL (#3112)
aliencaocao Feb 29, 2024
29a8d6a
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams (#…
njhill Feb 29, 2024
703e42e
Add guided decoding for OpenAI API server (#2819)
felixzhu555 Feb 29, 2024
54d3544
Fix: Output text is always truncated in some models (#3016)
HyperdriveHustle Mar 1, 2024
27ca23d
Remove exclude_unset in streaming response (#3143)
sh0416 Mar 1, 2024
49d849b
docs: Add tutorial on deploying vLLM model with KServe (#2586)
terrytangyuan Mar 1, 2024
90fbf12
fix relative import path of protocol.py (#3134)
Huarong Mar 1, 2024
be58c3b
T5 enc/dec example file; linting/formatting
afeldman-nm Mar 1, 2024
c0c2335
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
robertgshaw2-neuralmagic Mar 1, 2024
82091b8
Bump up to v0.3.3 (#3129)
WoosukKwon Mar 1, 2024
70837fd
native/vllm t5 comparison test
afeldman-nm Mar 1, 2024
42a6e2b
merged upstream-main into enc_dec_t5
afeldman-nm Mar 1, 2024
29e70e3
allow user chose log level by --log-level instead of fixed 'info'. (#…
AllenDou Mar 1, 2024
e3fd30d
Merge branch 'upstream-main' into enc_dec_t5
afeldman-nm Mar 2, 2024
db726e6
Merge pull request #1 from afeldman-nm/enc_dec_t5
js8544 Mar 2, 2024
43e920e
remove debug print statements
afeldman-nm Mar 2, 2024
431f014
silence warning; legacy=False for tokenizer; lint/format
afeldman-nm Mar 2, 2024
37fcf99
Merge branch 'js8544_enc_dec_t5' into enc_dec_t5
afeldman-nm Mar 2, 2024
baee28c
Reorder kv dtype check to avoid nvcc not found error on AMD platform …
cloudhan Mar 2, 2024
4bf056b
Merge pull request #2 from afeldman-nm/enc_dec_t5
js8544 Mar 2, 2024
ce4f5a2
Add Automatic Prefix Caching (#2762)
SageMoore Mar 2, 2024
d65fac2
Add vLLM version info to logs and openai API server (#3161)
jasonacox Mar 3, 2024
996d095
[FIX] Fix styles in automatic prefix caching & add a automatic prefix…
zhuohan123 Mar 3, 2024
17c3103
Make it easy to profile workers with nsight (#3162)
pcmoritz Mar 4, 2024
d0fae88
[DOC] add setup document to support neuron backend (#2777)
liangfu Mar 4, 2024
901cf4c
[Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171)
gty111 Mar 4, 2024
27a7b07
Add document for vllm paged attention kernel. (#2978)
pian13131 Mar 4, 2024
9cbc7e5
enable --gpu-memory-utilization in benchmark_throughput.py (#3175)
AllenDou Mar 4, 2024
76e8a70
[Minor fix] The domain dns.google may cause a socket.gaierror excepti…
ttbachyinsda Mar 4, 2024
22de452
Push logprob generation to LLMEngine (#3065)
Yard1 Mar 4, 2024
ff578ca
Add health check, make async Engine more robust (#3015)
Yard1 Mar 4, 2024
9a4548b
Fix the openai benchmarking requests to work with latest OpenAI apis …
wangchen615 Mar 4, 2024
05af6da
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#…
hongxiayang Mar 5, 2024
8a5060f
fix _make_tensor_with_pad args change which broke decoder scenario
afeldman-nm Mar 5, 2024
29d6f44
fixed bug caused by non-handling of self.model_config is None in mode…
afeldman-nm Mar 5, 2024
a4950ba
remove commented-out print statements
afeldman-nm Mar 5, 2024
9c03760
small cleanup
afeldman-nm Mar 5, 2024
8999ec3
Store `eos_token_id` in `Sequence` for easy access (#3166)
njhill Mar 5, 2024
2efce05
[Fix] Avoid pickling entire LLMEngine for Ray workers (#3207)
njhill Mar 6, 2024
9f20ccf
Merge pull request #3 from afeldman-nm/enc_dec_t5
js8544 Mar 6, 2024
24aecf4
[Tests] Add block manager and scheduler tests (#3108)
rkooo567 Mar 6, 2024
a33ce60
[Testing] Fix core tests (#3224)
cadedaniel Mar 6, 2024
4cb3b92
Add tqdm `dynamic_ncols=True` (#3242)
chujiezheng Mar 6, 2024
d3c04b6
Add GPTQ support for Gemma (#3200)
TechxGenus Mar 7, 2024
cbf4c05
Update requirements-dev.txt to include package for benchmarking scrip…
wangchen615 Mar 7, 2024
2daf23a
Separate attention backends (#3005)
WoosukKwon Mar 7, 2024
385da2d
Measure model memory usage (#3120)
mgoin Mar 7, 2024
6d6dccd
arg naming fix
afeldman-nm Mar 7, 2024
8cbba46
Possible fix for conflict between Automated Prefix Caching (#2762) an…
jacobthebanana Mar 7, 2024
b35cc93
Fix auto prefix bug (#3239)
ElizaWszola Mar 8, 2024
d2339d6
Connect engine healthcheck to openai server (#3260)
njhill Mar 8, 2024
c59e120
Feature add lora support for Qwen2 (#3177)
whyiug Mar 8, 2024
1ece1ae
[Minor Fix] Fix comments in benchmark_serving (#3252)
gty111 Mar 8, 2024
99c3cfb
[Docs] Fix Unmocked Imports (#3275)
ywang96 Mar 8, 2024
1cb0cc2
[FIX] Make `flash_attn` optional (#3269)
WoosukKwon Mar 8, 2024
c2c5e09
Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241)
mgoin Mar 8, 2024
f48c679
[FIX] Fix prefix test error on main (#3286)
zhuohan123 Mar 9, 2024
8437bae
[Speculative decoding 3/9] Worker which speculates, scores, and appli…
cadedaniel Mar 9, 2024
0bba88d
Enhance lora tests with more layer and rank variations (#3243)
tterrysun Mar 10, 2024
e4a28e5
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUD…
dllehr-amd Mar 10, 2024
9e8744a
[BugFix] Fix get tokenizer when using ray (#3301)
esmeetu Mar 11, 2024
4b59f00
[Fix] Fix best_of behavior when n=1 (#3298)
njhill Mar 11, 2024
2f8844b
Re-enable the 80 char line width limit (#3305)
zhuohan123 Mar 11, 2024
657061f
[docs] Add LoRA support information for models (#3299)
pcmoritz Mar 11, 2024
4c92270
Add distributed model executor abstraction (#3191)
zhuohan123 Mar 11, 2024
c9415c1
[ROCm] Fix warp and lane calculation in blockReduceSum (#3321)
kliuae Mar 11, 2024
654865e
Support Mistral Model Inference with transformers-neuronx (#3153)
DAIZHENWEI Mar 11, 2024
7035178
Merge branch 'js8544_enc_dec_t5' into enc_dec_t5
afeldman-nm Mar 12, 2024
dbec357
fixed attention_kernels.cu merge conflict; questions about ROCM
afeldman-nm Mar 12, 2024
b0925b3
docs: Add BentoML deployment doc (#3336)
Sherlock113 Mar 12, 2024
4b2a121
llm_engine.py conflict resolution; removed prefix caching code; Seque…
afeldman-nm Mar 12, 2024
a93c17d
actually updated Sequence constructor to take i_encoder_decoder, eos_…
afeldman-nm Mar 12, 2024
a62c3af
xformers.py accept incoming changes; replace paged_attention function…
afeldman-nm Mar 12, 2024
c31921f
saved changed to xformers woops
afeldman-nm Mar 12, 2024
0c78be9
attempt at fixing model_runner conflicts related to encoder/decoder &…
afeldman-nm Mar 12, 2024
e25e6b8
encoder/decoder + prefix caching not supported; moved check from llm.…
afeldman-nm Mar 12, 2024
7f70d76
refactoring, including: moved enc_dec_attention.py into vllm/model_ex…
afeldman-nm Mar 12, 2024
36c8291
existing regressions pass (yay) but encoder/decoder example fails
afeldman-nm Mar 12, 2024
08f268a
fixed encoder/decoder reshape and cache bug, but paged attention call…
afeldman-nm Mar 12, 2024
b9b0600
augmented paged attention with context_lens, max_context_len, block_t…
afeldman-nm Mar 12, 2024
63e9dca
linting/formatting fixes
afeldman-nm Mar 12, 2024
4d7e5a8
Merge branch 'upstream-main' into enc_dec_t5_merge_upstream2
afeldman-nm Mar 12, 2024
49a3c86
Fixes #1556 double free (#3347)
br3no Mar 13, 2024
602358f
Add kernel for GeGLU with approximate GELU (#3337)
WoosukKwon Mar 13, 2024
b167109
[Fix] Fix quantization="gptq" when using Marlin (#3319)
DreamTeamWangbowen Mar 13, 2024
e221910
add hf_transfer to requirements.txt (#3031)
RonanKMcGovern Mar 13, 2024
ba8dc95
[Minor] Fix bias in if to remove ambiguity (#3259)
hliuca Mar 13, 2024
739c350
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256)
chenxu2048 Mar 13, 2024
ae0ccb4
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism…
orsharir Mar 13, 2024
7e9bd08
Add batched RoPE kernel (#3095)
tterrysun Mar 13, 2024
c33afd8
Fix lint (#3388)
Yard1 Mar 13, 2024
eeab52a
[FIX] Simpler fix for async engine running on ray (#3371)
zhuohan123 Mar 13, 2024
81653d9
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion …
simon-mo Mar 14, 2024
a37415c
allow user to chose which vllm's merics to display in grafana (#3393)
AllenDou Mar 14, 2024
8fe8386
[Kernel] change benchmark script so that result can be directly used;…
youkaichao Mar 14, 2024
06ec486
Install `flash_attn` in Docker image (#3396)
tdoublep Mar 14, 2024
c17ca8e
Add args for mTLS support (#3410)
declark1 Mar 14, 2024
dfc7740
[issue templates] add some issue templates (#3412)
youkaichao Mar 14, 2024
54be8a0
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
chenxu2048 Mar 14, 2024
b983ba3
fix marlin config repr (#3414)
qeternity Mar 14, 2024
78b6c48
Dynamically configure shared memory size for moe_align_block_size_ker…
akhoroshev Mar 15, 2024
b522c44
[Misc] add HOST_IP env var (#3419)
youkaichao Mar 15, 2024
21539e6
Add chat templates for Falcon (#3420)
Dinghow Mar 15, 2024
253a980
Add chat templates for ChatGLM (#3418)
Dinghow Mar 15, 2024
429284d
Fix `dist.broadcast` stall without group argument (#3408)
GindaChen Mar 15, 2024
a7c8716
Fix tie_word_embeddings for Qwen2. (#3344)
fyabc Mar 15, 2024
03d37f2
[Fix] Add args for mTLS support (#3430)
declark1 Mar 15, 2024
14b8ae0
Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220)
sighingnow Mar 15, 2024
604f235
[Misc] add error message in non linux platform (#3438)
youkaichao Mar 15, 2024
a7af453
Fix issue templates (#3436)
hmellor Mar 15, 2024
8fa7357
fix document error for value and v_vec illustration (#3421)
laneeeee Mar 15, 2024
fb96c1e
Asynchronous tokenization (#2879)
Yard1 Mar 15, 2024
10585e0
Removed Extraneous Print Message From OAI Server (#3440)
robertgshaw2-neuralmagic Mar 16, 2024
bb7a219
Merge branch 'upstream-main' into enc_dec_t5_merge_upstream2
afeldman-nm Mar 16, 2024
413366e
[Misc] PR templates (#3413)
youkaichao Mar 16, 2024
0b60121
fixed bug introduced during formatting
afeldman-nm Mar 16, 2024
d44257e
fixed example
afeldman-nm Mar 16, 2024
19c5c4b
Merge branch 'enc_dec_t5' into enc_dec_t5_merge_upstream2
afeldman-nm Mar 16, 2024
3123f15
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
sighingnow Mar 16, 2024
14e3f9a
Replace `lstrip()` with `removeprefix()` to fix Ruff linter warning (…
ronensc Mar 16, 2024
cf6ff18
Fix Baichuan chat template (#3340)
Dinghow Mar 16, 2024
ad50bf4
fix lint
simon-mo Mar 16, 2024
8e67598
[Misc] fix line length for entire codebase (#3444)
simon-mo Mar 16, 2024
120157f
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211)
simon-mo Mar 16, 2024
6b78837
Fix setup.py neuron-ls issue (#2671)
simon-mo Mar 16, 2024
abfc4f3
[Misc] Use dataclass for InputMetadata (#3452)
WoosukKwon Mar 17, 2024
93348d9
[CI] Shard tests for LoRA and Kernels to speed up (#3445)
simon-mo Mar 17, 2024
9101d83
[Bugfix] Make moe_align_block_size AMD-compatible (#3470)
WoosukKwon Mar 18, 2024
8c654c0
CI: Add ROCm Docker Build (#2886)
simon-mo Mar 18, 2024
482b0ad
[Testing] Add test_config.py to CI (#3437)
cadedaniel Mar 18, 2024
097aa0e
[CI/Build] Fix Bad Import In Test (#3473)
robertgshaw2-neuralmagic Mar 18, 2024
c0c17d4
[Misc] Fix PR Template (#3478)
zhuohan123 Mar 18, 2024
9fdf3de
Cmake based build system (#2830)
bnellnm Mar 18, 2024
49eedea
[Core] Zero-copy asdict for InputMetadata (#3475)
Yard1 Mar 18, 2024
b30880a
[Misc] Update README for the Third vLLM Meetup (#3479)
zhuohan123 Mar 18, 2024
b37cdce
[Core] Cache some utils (#3474)
Yard1 Mar 19, 2024
6a9c583
[Core] print error before deadlock (#3459)
youkaichao Mar 19, 2024
ef65dcf
[Doc] Add docs about OpenAI compatible server (#3288)
simon-mo Mar 19, 2024
7341c77
[BugFix] Avoid initializing CUDA too early (#3487)
njhill Mar 19, 2024
c614cfe
Update dockerfile with ModelScope support (#3429)
ifsheldon Mar 19, 2024
c2f97b6
merged upstream
afeldman-nm Mar 19, 2024
2a60c9b
[Doc] minor fix to neuron-installation.rst (#3505)
jimburtoft Mar 19, 2024
cc63d03
Revert "[Core] Cache some utils" (#3507)
simon-mo Mar 19, 2024
63e8b28
[Doc] minor fix of spelling in amd-installation.rst (#3506)
jimburtoft Mar 19, 2024
0536ff5
rolled back some encoder/decoder changes
afeldman-nm Mar 19, 2024
20478c4
Use lru_cache for some environment detection utils (#3508)
simon-mo Mar 19, 2024
9474e89
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator perfor…
ElizaWszola Mar 20, 2024
4ad521d
[Core] Add generic typing to `LRUCache` (#3511)
njhill Mar 20, 2024
5ee1449
[Misc] Remove cache stream and cache events (#3461)
WoosukKwon Mar 20, 2024
84eaa68
Abort when nvcc command is not found in the PATH (#3527)
AllenDou Mar 20, 2024
ba8ae1d
Check for _is_cuda() in compute_num_jobs (#3481)
bnellnm Mar 20, 2024
80e2548
[Bugfix] Fix ROCm support in CMakeLists.txt (#3534)
jamestwhedbee Mar 20, 2024
426ec4e
[1/n] Triton sampling kernel (#3186)
Yard1 Mar 20, 2024
6e435de
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
rkooo567 Mar 20, 2024
f1c0fc3
Migrate `logits` computation and gather to `model_runner` (#3233)
esmeetu Mar 20, 2024
523e30e
[BugFix] Hot fix in setup.py for neuron build (#3537)
zhuohan123 Mar 21, 2024
6ebd02b
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor (#3431)
ElizaWszola Mar 21, 2024
3bbff9e
Fix 1D query issue from `_prune_hidden_states` (#3539)
rkooo567 Mar 21, 2024
4c07dd2
[🚀 Ready to be merged] Added support for Jais models (#3183)
grandiose-pizza Mar 21, 2024
8657323
[Misc][Log] Add log for tokenizer length not equal to vocabulary size…
esmeetu Mar 21, 2024
c188ecb
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config (#3551)
WoosukKwon Mar 21, 2024
b7050ca
[BugFix] gemma loading after quantization or LoRA. (#3553)
taeminlee Mar 21, 2024
ea5f14e
[Bugfix][Model] Fix Qwen2 (#3554)
esmeetu Mar 22, 2024
e90fc21
[Hardware][Neuron] Refactor neuron support (#3471)
zhuohan123 Mar 22, 2024
f721096
[BugFix] Some fixes for custom allreduce kernels (#2760)
hanzhi713 Mar 22, 2024
7d4972c
merged in upstream-main
afeldman-nm Mar 22, 2024
23a5da5
added cross_block_tables to SequenceGroupMetadata
afeldman-nm Mar 22, 2024
e32fb9c
SequenceGroupMetadata: added cross_seq_data; optional along with cros…
afeldman-nm Mar 22, 2024
ae1c368
added block manager allocation of cross sequence block_tables
afeldman-nm Mar 22, 2024
691c2c1
scheduler schedule() support cross block-tables and cross sequences, …
afeldman-nm Mar 22, 2024
e240eb4
LLMEngine can build a sequencegroup with cross sequences
afeldman-nm Mar 22, 2024
cbfba8e
t5 Sampler does not pass vocab size to constructor; input_metadata.pr…
afeldman-nm Mar 22, 2024
501551c
add_request now correctly swaps decoder_prompt, prompt in encoder/de…
afeldman-nm Mar 22, 2024
08435e4
Added cross_input_metadata field to InputMetadata
afeldman-nm Mar 22, 2024
6e459a2
wip multi blocktable
afeldman-nm Mar 25, 2024
8e1ca33
wip
afeldman-nm Mar 25, 2024
e097732
plumbing dummy input metadata structures into model
afeldman-nm Mar 25, 2024
2a44585
plumbed encoder/decoder input metadata all the way into t5
afeldman-nm Mar 25, 2024
91a4608
first pass at T5 encoder support
afeldman-nm Mar 26, 2024
d0c5e36
inefficient but effective & Attention-wrapper-compatible implementati…
afeldman-nm Mar 27, 2024
3737d5b
wip cross-attention
afeldman-nm Mar 28, 2024
38946ed
first pass at enc/dec support that runs e2e but doesn't produce corre…
afeldman-nm Apr 1, 2024
3c39f55
to pass regression tests: removed debug prints
afeldman-nm Apr 1, 2024
4ec2fde
wip vllm, examples => fp32
afeldman-nm Apr 1, 2024
38f55ed
works on bsz = 1
afeldman-nm Apr 1, 2024
1aedc80
intermediate activations for prompt_run look right! Decoded token loo…
afeldman-nm Apr 1, 2024
c1258b4
wip
afeldman-nm Apr 2, 2024
0af1022
passing with t5-small
afeldman-nm Apr 3, 2024
9e8d234
vLLM T5 matches nativegit status! fixes: decode-phase cross-input-met…
afeldman-nm Apr 4, 2024
f5242a0
refactoring out print statements
afeldman-nm Apr 4, 2024
de0fd31
fix to pass regression tests
afeldman-nm Apr 4, 2024
5a67647
WIP google/flan-t5-xxxx
afeldman-nm Apr 4, 2024
ed05d47
removed print statement
afeldman-nm Apr 4, 2024
d5a8b92
batched enc/dec example
afeldman-nm Apr 10, 2024
f555f5d
wip, trying prompt padding
afeldman-nm Apr 12, 2024
2c12b44
bs >1 prefill works
afeldman-nm Apr 17, 2024
dba02b2
small change to examples
afeldman-nm Apr 17, 2024
db201b6
fix to support case where num prompts != 2
afeldman-nm Apr 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
38 changes: 38 additions & 0 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This script build the ROCm docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Print ROCm version
rocminfo

# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .

# Setup cleanup
remove_docker_container() { docker rm -f rocm || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device /dev/kfd --device /dev/dri --network host --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
27 changes: 20 additions & 7 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ steps:
- label: Basic Correctness Test
command: pytest -v -s --forked basic_correctness

- label: Core Test
command: pytest -v -s core

- label: Distributed Comm Ops Test
command: pytest -v -s --forked test_comm_ops.py
working_dir: "/vllm-workspace/tests/distributed"
Expand All @@ -25,14 +28,14 @@ steps:
num_gpus: 2 # only support 1 or 2 for now.

- label: Engine Test
command: pytest -v -s engine
command: pytest -v -s engine tokenization test_sequence.py test_config.py

- label: Entrypoints Test
command: pytest -v -s entrypoints

- label: Kernels Test
command: pytest -v -s kernels
soft_fail: true
- label: Kernels Test %N
command: pytest -v -s kernels --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Models Test
commands:
Expand All @@ -44,13 +47,23 @@ steps:
- pytest -v -s prefix_caching

- label: Samplers Test
command: pytest -v -s samplers --forked
command: pytest -v -s samplers

- label: LogitsProcessor Test
command: pytest -v -s test_logits_processor.py

- label: Worker Test
command: pytest -v -s worker

- label: LoRA Test
command: pytest -v -s lora
- label: Speculative decoding tests
command: pytest -v -s spec_decode

- label: LoRA Test %N
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 4

- label: Metrics Test
command: pytest -v -s metrics

- label: Benchmarks
working_dir: "/vllm-workspace/.buildkite"
Expand Down
8 changes: 8 additions & 0 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:
- label: "AMD Test"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh

- label: ":docker: build image"
commands:
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
Expand All @@ -20,6 +25,9 @@ steps:
agents:
queue: kubernetes
soft_fail: {{ step.soft_fail or false }}
{% if step.parallelism %}
parallelism: {{ step.parallelism }}
{% endif %}
retry:
automatic:
- exit_status: -1 # Agent was lost
Expand Down
4 changes: 4 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Use the Google style in this project.
BasedOnStyle: Google

ColumnLimit: 120
22 changes: 22 additions & 0 deletions .github/ISSUE_TEMPLATE/100-documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: 📚 Documentation
description: Report an issue related to https://docs.vllm.ai/
title: "[Doc]: "
labels: ["documentation"]

body:
- type: textarea
attributes:
label: 📚 The doc issue
description: >
A clear and concise description of what content in https://docs.vllm.ai/ is an issue.
validations:
required: true
- type: textarea
attributes:
label: Suggest a potential alternative/fix
description: >
Tell us how we could improve the documentation in this regard.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/200-installation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: 🛠️ Installation
description: Report an issue here when you hit errors during installation.
title: "[Installation]: "
labels: ["installation"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: How you are installing vllm
description: |
Paste the full command you are trying to execute.
value: |
```sh
pip install -vvv vllm
```
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
37 changes: 37 additions & 0 deletions .github/ISSUE_TEMPLATE/300-usage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: 💻 Usage
description: Raise an issue here if you don't know how to use vllm.
title: "[Usage]: "
labels: ["usage"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: How would you like to use vllm
description: |
A detailed description of how you want to use vllm.
value: |
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
81 changes: 81 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
name: 🐛 Bug report
description: Raise an issue here if you find a bug.
title: "[Bug]: "
labels: ["bug"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Your current environment
description: |
Please run the following and paste the output below.
```sh
wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
```
value: |
```text
The output of `python collect_env.py`
```
validations:
required: true
- type: textarea
attributes:
label: 🐛 Describe the bug
description: |
Please provide a clear and concise description of what the bug is.

If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:

```python
from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.

Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
placeholder: |
A clear and concise description of what the bug is.

```python
# Sample code to reproduce the problem
```

```
The error message you got, with the full traceback.
```
validations:
required: true
- type: markdown
attributes:
value: >
⚠️ Please separate bugs of `transformers` implementation or usage from bugs of `vllm`. If you think anything is wrong with the models' output:

- Try the counterpart of `transformers` first. If the error appears, please go to [their issues](https://github.com/huggingface/transformers/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc).

- If the error only appears in vllm, please provide the detailed script of how you run `transformers` and `vllm`, also highlight the difference and what you expect.

Thanks for contributing 🎉!
31 changes: 31 additions & 0 deletions .github/ISSUE_TEMPLATE/500-feature request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: 🚀 Feature request
description: Submit a proposal/request for a new vllm feature
title: "[Feature]: "
labels: ["feature request"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: 🚀 The feature, motivation and pitch
description: >
A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
validations:
required: true
- type: textarea
attributes:
label: Alternatives
description: >
A description of any alternative solutions or features you've considered, if any.
- type: textarea
attributes:
label: Additional context
description: >
Add any other context or screenshots about the feature request.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/600-new model.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: 🤗 Support request for a new model from huggingface
description: Submit a proposal/request for a new model from huggingface
title: "[New Model]: "
labels: ["new model"]

body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).

#### We also highly recommend you read https://docs.vllm.ai/en/latest/models/adding_model.html first to understand how to add a new model.
- type: textarea
attributes:
label: The model to consider.
description: >
A huggingface url, pointing to the model, e.g. https://huggingface.co/openai-community/gpt2 .
validations:
required: true
- type: textarea
attributes:
label: The closest model vllm already supports.
description: >
Here is the list of models already supported by vllm: https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models . Which model is the most similar to the model you want to add support for?
- type: textarea
attributes:
label: What's your difficulty of supporting the model you want?
description: >
For example, any new operators or new architecture?
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!