-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Issue when trying to load a AWQ model with --load-in-4bits for mixtral flavors #342
Comments
Please remove the |
will try, |
@AlpinDale nope, same stack trace, Also checked with an awq non mixtral moe and works like a charm inclusive with the --quantization awq + load in 4 bits, I was able to notice a significant increase in the generation of tokens compared to loading the awq without the load in 4 bits parameter |
How large is the increase in speed when adding the load-in-4-bit to awq models, and did you notice it on all models? Also, does it affect generation quality at all? |
@SalomonKisters in a 4090 I would say that the increase of speed it is noticeable to the naked eye (did not benchmarked yet) |
Okay, sounds nice. So you just use AWQ quantized models with "--quantization awq --load-in-4bit"? |
yes, but is not working for Moe's, for moes gptq is the best option now I think... |
Ah makes sense, but for GPTQs it doesnt work combined with --load-in-4bit, right? |
right |
Any update ? I have the same error Message |
Your current environment
Thats the output of my host (i 'm running the engine with the official docker image)
🐛 Describe the bug
When I try to load AWQ quant model with --load-in-4bits and the model is a Mixtral kind moe it throw the following stack trace:
entry point command executed inside the docker:
python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 3000 --download-dir /data/hub --model macadeliccc/laser-dolphin-mixtral-4x7b-dpo-AWQ --dtype float16 --kv-cache-dtype fp8_e5m2 --max-model-len 12000 --tensor-parallel-size 2 --gpu-memory-utilization .98 --enforce-eager --block-size 8 --max-paddings 512 --port 3000 --swap-space 10 --chat-template /home/workspace/chat_templates/chat_ml.jinja --served-model-name dolf --max-context-len-to-capture 512 --max-num-batched-tokens 32000 --max-num-seqs 62 --quantization awq --load-in-4bit
The text was updated successfully, but these errors were encountered: