v0.4.2 Release Tracker #4505

simon-mo · 2024-04-30T17:28:48Z

ETA May 3rd, Friday.

robertgshaw2-neuralmagic · 2024-04-30T17:30:30Z

vrdn-23 · 2024-04-30T20:22:42Z

Would be possible to get #4419, #4357 and #3763 also included in this release? The dependency on ray for multiple GPUs in a single node is mostly a pain to deal with!

nivibilla · 2024-04-30T22:01:03Z

Following on from @vrdn-23 , #3466 would be great too. I already use ray for scaling across multiple nodes. And this is the only solution that works when using models that don't fit within a single GPU.

vrdn-23 · 2024-04-30T22:20:35Z

#3466 got split into a bunch of smaller PRs from what I understand (out of which #4419 and #4357 are still yet to be merged I think), so I think we're asking for the same thing. :)

jeejeelee · 2024-05-01T03:45:14Z

Could we consider #4132? It's proven to be incredibly useful in my development process

rkooo567 · 2024-05-01T13:11:59Z

#4451 -> this has to be included in the next release (otherwise chunked prefill will crash when preemption is used)

cadedaniel · 2024-05-01T19:13:25Z

@robertgshaw2-neuralmagic for block manager V2 we still need to do profiling before we swap over. I made an issue for tracking #4537

robertgshaw2-neuralmagic · 2024-05-01T22:00:41Z

@cadedaniel do you need something from the NM side on this?

cadedaniel · 2024-05-01T22:04:17Z

sure if there's interest :) I mention it because APC in BlockManagerv2 https://github.com/vllm-project/vllm/pull/4142 is not strictly necessary for release (block manager v2 not ready)

aliozts · 2024-05-02T08:12:34Z

Is it possible to include #4305 ?

simon-mo · 2024-05-05T07:20:30Z

Released https://github.com/vllm-project/vllm/releases/tag/v0.4.2

Notably:

Even though our tests are starting to run against CUDA 12.4. This release is still with CUDA 12.1 for both wheel build and Docker.
The debug info has been stripped due to wheel size issue. Docker release is not affected.

RyanWMHI · 2024-05-10T11:56:21Z

I used the squeezeLLM 4bit to quant my model. While it seems that there is a bug.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
[rank0]: main(args)
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main
[rank0]: llm = LLM(model=args.model,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
[rank0]: self._init_non_spec_worker()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
[rank0]: self.driver_worker.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 407, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.layers.0.self_attn.qkv_proj.rows'

after I changed the code llama.py in
if name != "model.layers.0.self_attn.qkv_proj.rows":
param = params_dict[name]
else:
param = params_dict["model.layers.0.self_attn.qkv_proj.qweight"]

Still get a gut
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 195, in
[rank0]: main(args)
[rank0]: File "/home/ryan/vllm/benchmarks/benchmark_latency.py", line 20, in main
[rank0]: llm = LLM(model=args.model,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/entrypoints/llm.py", line 123, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 292, in from_engine_args
[rank0]: engine = cls(
[rank0]: ^^^^
[rank0]: File "/home/ryan/vllm/vllm/engine/llm_engine.py", line 160, in init
[rank0]: self.model_executor = executor_class(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 23, in _init_executor
[rank0]: self._init_non_spec_worker()
[rank0]: File "/home/ryan/vllm/vllm/executor/gpu_executor.py", line 69, in _init_non_spec_worker
[rank0]: self.driver_worker.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/worker.py", line 118, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/ryan/vllm/vllm/worker/model_runner.py", line 164, in load_model
[rank0]: self.model = get_model(
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/init.py", line 19, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/ryan/vllm/vllm/model_executor/model_loader/loader.py", line 224, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/ryan/vllm/vllm/model_executor/models/llama.py", line 412, in load_weights
[rank0]: weight_loader(param, loaded_weight, shard_id)
[rank0]: File "/home/ryan/vllm/vllm/model_executor/layers/linear.py", line 561, in weight_loader
[rank0]: loaded_weight = loaded_weight.narrow(output_dim, start_idx,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

simon-mo added misc release Related to new version release and removed misc labels Apr 30, 2024

simon-mo pinned this issue May 1, 2024

simon-mo closed this as completed May 5, 2024

simon-mo unpinned this issue May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.2 Release Tracker #4505

v0.4.2 Release Tracker #4505

simon-mo commented Apr 30, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 30, 2024 •

edited by simon-mo

vrdn-23 commented Apr 30, 2024

nivibilla commented Apr 30, 2024

vrdn-23 commented Apr 30, 2024

jeejeelee commented May 1, 2024

rkooo567 commented May 1, 2024

cadedaniel commented May 1, 2024

robertgshaw2-neuralmagic commented May 1, 2024

cadedaniel commented May 1, 2024

aliozts commented May 2, 2024

simon-mo commented May 5, 2024

RyanWMHI commented May 10, 2024

v0.4.2 Release Tracker #4505

v0.4.2 Release Tracker #4505

Comments

simon-mo commented Apr 30, 2024 • edited

robertgshaw2-neuralmagic commented Apr 30, 2024 • edited by simon-mo

vrdn-23 commented Apr 30, 2024

nivibilla commented Apr 30, 2024

vrdn-23 commented Apr 30, 2024

jeejeelee commented May 1, 2024

rkooo567 commented May 1, 2024

cadedaniel commented May 1, 2024

robertgshaw2-neuralmagic commented May 1, 2024

cadedaniel commented May 1, 2024

aliozts commented May 2, 2024

simon-mo commented May 5, 2024

RyanWMHI commented May 10, 2024

simon-mo commented Apr 30, 2024 •

edited

robertgshaw2-neuralmagic commented Apr 30, 2024 •

edited by simon-mo