[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471

Ph0rk0z · 2024-05-14T13:40:18Z

Your current environment

conda nccl v2.21.5.1

🐛 Describe the bug

I have 4 GPUs. 3x3090 and 1x2080ti 22g.

I try to load cat llama 70b 5.0bpw exl2 with aphrodite. If I don't disable custom all reduce, it tries to grind away at the peer cache for a while at 100% gpu usage to the non nvlinked GPUs. Then it fails.

If I try to disable custom all reduce, it OOM by a few 100MB. In nvtop I see it load something onto all 4 GPUs, kill it and then loads onto gpu 0 to 98% and OOM. Have tried setting GPU% to various values without luck. Tried with and without flash attention or setting a context limit.

Never gets past model_load.

Ph0rk0z · 2024-05-14T14:14:17Z

I got further when I loaded a gptq model. It turns out you have to specify quantization or else you will get an OOM. This isn't very intuitive. Unfortunately I'm still finding context consumes a LOT of memory. I am only using batch size of 1 so I don't get how I can't load a GPTQ model with more than 4096 context.

on all 4 now: determine_num_available_blocks causes a deadlock and GPUs to get stuck.

removing flash_attn got rid of the deadlock but now I am get 2.5t/s and GPUs are utilized to 77% memory each. I thought this was supposed to be faster than pipeline parallel?

sgsdxzy · 2024-05-14T19:24:17Z

Tensor parallelism by nature doesn't work well with asymmetric setup. The 2080ti is dragging 3090s down in terms of both vram and speed. For optimal performance you really need 3090x4.
You can add --enable-chunked-prefill to launch options to save vram.

Ph0rk0z · 2024-05-15T01:05:48Z

It's only 2gb less than a 3090. Compute wise yea, it's a bit slower. When used with pure exllama or other engines the hit isn't that bad.

When I try chunked prefill I get:

ERROR:      File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/triton/language/semantic.py", line 1207, in dot
ERROR:        assert_dtypes_valid(lhs.dtype, rhs.dtype, builder.options)
ERROR:      File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/triton/language/semantic.py", line 1183, in assert_dtypes_valid
ERROR:        assert lhs_dtype == rhs_dtype, f"First input ({lhs_dtype}) and second input ({rhs_dtype}) must have the same dtype!"
ERROR:               ^^^^^^^^^^^^^^^^^^^^^^
ERROR:    AssertionError: First input (fp16) and second input (uint8) must have the same dtype!

Have tried all --max-num-batched-tokens --max-model-len --kv-cache-dtype fp8 and setting max requests to 1 but no dice.

On 2 GPU I can only fit 8192.


INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = '/mnt/7d815d93-e74c-4d1e-b1da-6d7e1d187a17/models/Midnight-Miqu-70B-v1.0_GPTQ32G'
INFO:     Speculative Config = None
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gptq
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = fp8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
INFO:     Using FlashAttention backend.
(RayWorkerAphrodite pid=935938) INFO:     Using FlashAttention backend.
INFO:     Aphrodite is using nccl==2.21.5
(RayWorkerAphrodite pid=935938) INFO:     Aphrodite is using nccl==2.21.5
INFO:     reading GPU P2P access cache from /home/supermicro/.config/aphrodite/gpu_p2p_access_cache_for_0,2.json
(RayWorkerAphrodite pid=935938) INFO:     reading GPU P2P access cache from /home/supermicro/.config/aphrodite/gpu_p2p_access_cache_for_0,2.json
INFO:     Model weights loaded. Memory usage: 19.82 GiB x 2 = 39.63 GiB
(RayWorkerAphrodite pid=935938) INFO:     Model weights loaded. Memory usage: 19.82 GiB x 2 = 39.63 GiB
INFO:     # GPU blocks: 537, # CPU blocks: 3276
INFO:     Minimum concurrency: 1.05x
INFO:     Maximum sequence length allowed in the cache: 8592
(RayWorkerAphrodite pid=935938) INFO:     Maximum sequence length allowed in the cache: 8592

That 4bit cache is really something. I can normally fit 32k with a GPTQ model like this and 16K with 5bit EXL2, only in 48gb.

Throughput is not massively better either:

INFO:     Avg prompt throughput: 157.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.9%, CPU KV
cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 12.8%, CPU KV
cache usage: 0.0%

In this case cards have PCIE 3.0x16 and Nvlink. Perhaps I am missing a setting to make it more adapted to single batch or it can only go fast when the # of batches is >1

AlpinDale · 2024-05-15T06:08:17Z

At the moment, FP8 can't work with chunked prefill/context shifting. There's some work being done in this branch to address this issue.

ortegaalfredo · 2024-05-15T16:14:25Z

I hit a similar bug:

Environment:
4x3090, Cuda 12.4, Aphrodite 0.53, total 96GB of VRAM, tensor parallel=4.

When I try to load elinas_Meta-Llama-3-120B-Instruct-4.0bpw-exl2 (61GB), it runs out of VRAM instantly, it don't even attempt to actually load the model from disk.

But it can load Meta-Llama-3-70B-Instruct-8.0bpw-h8-exl2 just fine, even if this model is bigger at 68 GB.

Bot models load fine with exllamav2.

sgsdxzy · 2024-05-16T08:13:51Z

You need to specify -q exl2 for exl2 models if they are quantized with older versions of exllamav2 and doesn't have quantization config in config.json

Ph0rk0z · 2024-05-16T11:04:42Z

So to compile I need to do it in 11.8 still? I am using 12.x conda and I had trouble. It wasn't able to find ninja despite it being installed and available from the command line.

josephrocca · 2024-06-01T11:43:09Z

As another point of reference for others hitting this thread while debugging, wolfram/miquliz-120b-v2.0-3.0bpw-h6-exl2 works on two 4090s with 3000 context length using --enable-chunked-prefill using the latest¹ official Aphrodite Docker image as of writing.

Once we can use --kv-cache-dtype fp8 alongside fp8 cache, that should go up to ~6k context length which would be usable. Keenly looking forward to that!

I am a little confused about wolfram's numbers in the model's readme though, which say that I should be getting 2x the above context numbers for that 3.0bpw model: https://huggingface.co/wolfram/miquliz-120b-v2.0-3.0bpw-h6-exl2

¹ alpindale/aphrodite-engine@sha256:b1e72201654a172e044a13d9346264a8b4e562dba8f3572bd92f013cf5420eb1

Ph0rk0z · 2024-06-01T11:59:56Z

classic exllamav2 lets you fit that context. for some reason aphrodite uses more.

sgsdxzy · 2024-06-01T12:57:24Z

You can set --max_num_batched_tokens to a lower value (default is 768) alongside --enable-chunked-prefill to further reduce vram usage to squeeze in a bit more context. Won't be significant, though.
Quantized kv cache is the way to go, sadly Alpin is too busy these days to support them.

josephrocca · 2024-06-01T16:45:44Z

Thanks! I'll give that a shot.

Quantized kv cache is the way to go, sadly Alpin is too busy these days to support them.

I just want to be clear here (not to say that you implied otherwise!): Alpin has done way too much for the OSS ML community already. It's kind of embarrassing how much of the open-source ML community rests on the shoulders of a handful of giants like Alpin. Whatever Alpin is up to, I'd be willing to bet that they're investing their time well!

josephrocca · 2024-06-02T10:50:04Z

classic exllamav2 lets you fit that context. for some reason aphrodite uses more.

A potential clue: With LoneStriker/llama-3-70B-Instruct-abliterated-4.65bpw-h6-exl2 on latest official Docker image:

python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --download-dir /tmp/hub --model LoneStriker/llama-3-70B-Instruct-abliterated-4.65bpw-h6-exl2 --revision 8da3c6266899aae3a6041f4314ad4f84ba5c1d76 --kv-cache-dtype fp8 --tensor-parallel-size 2 --gpu-memory-utilization 0.99 --quantization exl2 --max-model-len 2000 --max-log-len 1000

(Note the --max-model-len 2000)

It works fine, and I get this:

2024-06-02T10:29:23.866491109Z INFO:     Model weights loaded. Memory usage: 20.01 GiB x 2 = 40.01 GiB
2024-06-02T10:29:50.229697819Z (RayWorkerAphrodite pid=1980) INFO:     Model weights loaded. Memory usage: 19.95 GiB x 2 = 39.90 GiB
2024-06-02T10:29:50.229737795Z INFO:     # GPU blocks: 612, # CPU blocks: 3276
2024-06-02T10:29:50.230003675Z INFO:     Minimum concurrency: 4.90x
2024-06-02T10:29:50.230777711Z INFO:     Maximum sequence length allowed in the cache: 9792

With how smooth my brain in areas relevant to understanding this, I can only assume that 9792 means I should either be able to get a context length of 9792. But if I switch to --max-model-len 4096, then I get this:

2024-06-02T10:45:20.603829375Z [rank0]:   File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 309, in _run_workers
2024-06-02T10:45:20.603834254Z [rank0]:     driver_worker_output = getattr(self.driver_worker,
2024-06-02T10:45:20.603839323Z [rank0]:   File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 176, in initialize_cache
2024-06-02T10:45:20.603844283Z [rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
2024-06-02T10:45:20.603849532Z [rank0]:   File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 345, in raise_if_cache_size_invalid
02T10:45:20.603861014Z [rank0]: ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (2368). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

So TL;DR: It looks like by simply raising --max-model-len from 2000 to 4096, the maximum number of tokens that can be stored in KV cache somehow decreased from 9792 to 2368.

Again, I could be misunderstanding, but is it perhaps possible that there's a miscalculation in raise_if_cache_size_invalid?

sgsdxzy · 2024-06-02T19:45:24Z

If you don't use chunked prefill, activations also cost vram, so the longer you set max-model-len the fewer kv cache you can fit. That's why chunked prefill is important as it makes activations use constant vram.

josephrocca · 2024-06-03T09:45:23Z

Ah, thank you! Removing --kv-cache-dtype fp8 and adding --enable-chunked-prefill was the only way to get this model to fit.

Ph0rk0z added the bug Something isn't working label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471

Ph0rk0z commented May 14, 2024 •

edited

Ph0rk0z commented May 14, 2024 •

edited

sgsdxzy commented May 14, 2024

Ph0rk0z commented May 15, 2024

AlpinDale commented May 15, 2024

ortegaalfredo commented May 15, 2024 •

edited

sgsdxzy commented May 16, 2024

Ph0rk0z commented May 16, 2024 •

edited

josephrocca commented Jun 1, 2024 •

edited

Ph0rk0z commented Jun 1, 2024

sgsdxzy commented Jun 1, 2024

josephrocca commented Jun 1, 2024 •

edited

josephrocca commented Jun 2, 2024 •

edited

sgsdxzy commented Jun 2, 2024

josephrocca commented Jun 3, 2024

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471

Comments

Ph0rk0z commented May 14, 2024 • edited

Your current environment

🐛 Describe the bug

Ph0rk0z commented May 14, 2024 • edited

sgsdxzy commented May 14, 2024

Ph0rk0z commented May 15, 2024

AlpinDale commented May 15, 2024

ortegaalfredo commented May 15, 2024 • edited

sgsdxzy commented May 16, 2024

Ph0rk0z commented May 16, 2024 • edited

josephrocca commented Jun 1, 2024 • edited

Ph0rk0z commented Jun 1, 2024

sgsdxzy commented Jun 1, 2024

josephrocca commented Jun 1, 2024 • edited

josephrocca commented Jun 2, 2024 • edited

sgsdxzy commented Jun 2, 2024

josephrocca commented Jun 3, 2024

Ph0rk0z commented May 14, 2024 •

edited

Ph0rk0z commented May 14, 2024 •

edited

ortegaalfredo commented May 15, 2024 •

edited

Ph0rk0z commented May 16, 2024 •

edited

josephrocca commented Jun 1, 2024 •

edited

josephrocca commented Jun 1, 2024 •

edited

josephrocca commented Jun 2, 2024 •

edited