AWQ quantization is very slow for ONNX LLMs #1609

PatriceVignola · 2024-02-10T09:16:46Z

I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following np.matmul calls take forever to execute, and I estimate it would take days to quantize the model at current pace:

neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py

Line 466 in 26b260e

org_out = np.matmul(inp, weight)

neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py

Line 488 in 26b260e

out = np.matmul(inp, q_weight.T)

neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py

Line 615 in 26b260e

org_out = np.matmul(inp, org_weight) # n_token, oc

neural-compressor/neural_compressor/adaptor/ox_utils/weight_only.py

Line 636 in 26b260e

cur_out = np.matmul(inp, weight.T)

Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the np.matmul calls with simple torch.matmul calls on CUDA devices makes it exponentially faster.

Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.

The text was updated successfully, but these errors were encountered:

yuwenzho · 2024-02-19T07:40:17Z

It takes about 1 hour to run AWQ quantization on Llama-2-7b model with our test device using scripts in our llama weight-only quantization example. You can refer to the options for AWQ in main.py#L325-L336.

We currently have no plans to support torch tensor computation in our ONNX weight-only quantization. However, we recommend considering alternative solutions using CuPy instead of numpy for GPU-accelerated computing. You can try to implement this method yourself.

chensuyue assigned yuwenzho Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ quantization is very slow for ONNX LLMs #1609

AWQ quantization is very slow for ONNX LLMs #1609

PatriceVignola commented Feb 10, 2024 •

edited

yuwenzho commented Feb 19, 2024

AWQ quantization is very slow for ONNX LLMs #1609

AWQ quantization is very slow for ONNX LLMs #1609

Comments

PatriceVignola commented Feb 10, 2024 • edited

yuwenzho commented Feb 19, 2024

PatriceVignola commented Feb 10, 2024 •

edited