New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LLM large models #2678
Comments
Tagging @quic-sendilk @quic-hsukumar here. |
@shifeiwen, can you explain what you mean by group quantization? Are you looking for a Jupyter NB example which shows quantization simulation for a LLM model? |
@quic-mangal |
@shifeiwen, we don't support block quantization ATM. Only per-tensor and per-channel quantization are supported. |
Motivation: LLM is currently revolutionizing people's lives, and Qualcomm's mobile devices play an important role in people's lives. Qualcomm promotes on the Internet that it can achieve a 7B LLM model with a decoding speed of 20 token/s. I think AImet must be involved in this. However, some current mobile LLM quantification technologies are based on W4A16 group quantization. I want to know when AImet can An example is given to perform group quantization of W4A16 on the open source LLM model. This way I can try to reach a higher level with QNN inference.
Request: Regarding W4A16 grouping quantization of LLM models.
Current attempt: I tried adding the inverse quantization in MLC-LLM as an operator into a 3rd party OPpackage, the cpu version of which I can verify. However, HTP has caused great difficulties for me. Whether it is the completeness of the documentation or the errors reported during compilation, I am very confused. But I'm still trying to use it, and I hope aimet can come up with a similar example. Thanks
The text was updated successfully, but these errors were encountered: