-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs. #471
Comments
I got further when I loaded a gptq model. It turns out you have to specify quantization or else you will get an OOM. This isn't very intuitive. Unfortunately I'm still finding context consumes a LOT of memory. I am only using batch size of 1 so I don't get how I can't load a GPTQ model with more than 4096 context. on all 4 now: determine_num_available_blocks causes a deadlock and GPUs to get stuck. removing flash_attn got rid of the deadlock but now I am get 2.5t/s and GPUs are utilized to 77% memory each. I thought this was supposed to be faster than pipeline parallel? |
Tensor parallelism by nature doesn't work well with asymmetric setup. The 2080ti is dragging 3090s down in terms of both vram and speed. For optimal performance you really need 3090x4. |
It's only 2gb less than a 3090. Compute wise yea, it's a bit slower. When used with pure exllama or other engines the hit isn't that bad. When I try chunked prefill I get:
Have tried all On 2 GPU I can only fit 8192.
That 4bit cache is really something. I can normally fit 32k with a GPTQ model like this and 16K with 5bit EXL2, only in 48gb. Throughput is not massively better either:
In this case cards have PCIE 3.0x16 and Nvlink. Perhaps I am missing a setting to make it more adapted to single batch or it can only go fast when the # of batches is >1 |
At the moment, FP8 can't work with chunked prefill/context shifting. There's some work being done in this branch to address this issue. |
I hit a similar bug: Environment: When I try to load elinas_Meta-Llama-3-120B-Instruct-4.0bpw-exl2 (61GB), it runs out of VRAM instantly, it don't even attempt to actually load the model from disk. But it can load Meta-Llama-3-70B-Instruct-8.0bpw-h8-exl2 just fine, even if this model is bigger at 68 GB. Bot models load fine with exllamav2. |
You need to specify |
So to compile I need to do it in 11.8 still? I am using 12.x conda and I had trouble. It wasn't able to find ninja despite it being installed and available from the command line. |
As another point of reference for others hitting this thread while debugging, wolfram/miquliz-120b-v2.0-3.0bpw-h6-exl2 works on two 4090s with 3000 context length using Once we can use I am a little confused about wolfram's numbers in the model's readme though, which say that I should be getting 2x the above context numbers for that 3.0bpw model: https://huggingface.co/wolfram/miquliz-120b-v2.0-3.0bpw-h6-exl2 1 |
classic exllamav2 lets you fit that context. for some reason aphrodite uses more. |
You can set |
Thanks! I'll give that a shot.
I just want to be clear here (not to say that you implied otherwise!): Alpin has done way too much for the OSS ML community already. It's kind of embarrassing how much of the open-source ML community rests on the shoulders of a handful of giants like Alpin. Whatever Alpin is up to, I'd be willing to bet that they're investing their time well! |
A potential clue: With
(Note the It works fine, and I get this:
With how smooth my brain in areas relevant to understanding this, I can only assume that
So TL;DR: It looks like by simply raising Again, I could be misunderstanding, but is it perhaps possible that there's a miscalculation in |
If you don't use chunked prefill, activations also cost vram, so the longer you set max-model-len the fewer kv cache you can fit. That's why chunked prefill is important as it makes activations use constant vram. |
Ah, thank you! Removing |
Your current environment
conda nccl v2.21.5.1
馃悰 Describe the bug
I have 4 GPUs. 3x3090 and 1x2080ti 22g.
I try to load cat llama 70b 5.0bpw exl2 with aphrodite. If I don't disable custom all reduce, it tries to grind away at the peer cache for a while at 100% gpu usage to the non nvlinked GPUs. Then it fails.
If I try to disable custom all reduce, it OOM by a few 100MB. In nvtop I see it load something onto all 4 GPUs, kill it and then loads onto gpu 0 to 98% and OOM. Have tried setting GPU% to various values without luck. Tried with and without flash attention or setting a context limit.
Never gets past model_load.
The text was updated successfully, but these errors were encountered: