Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LLM large models #2678

Open
shifeiwen opened this issue Jan 28, 2024 · 4 comments
Open

Support LLM large models #2678

shifeiwen opened this issue Jan 28, 2024 · 4 comments
Labels
question Further information is requested

Comments

@shifeiwen
Copy link

Motivation: LLM is currently revolutionizing people's lives, and Qualcomm's mobile devices play an important role in people's lives. Qualcomm promotes on the Internet that it can achieve a 7B LLM model with a decoding speed of 20 token/s. I think AImet must be involved in this. However, some current mobile LLM quantification technologies are based on W4A16 group quantization. I want to know when AImet can An example is given to perform group quantization of W4A16 on the open source LLM model. This way I can try to reach a higher level with QNN inference.
Request: Regarding W4A16 grouping quantization of LLM models.
Current attempt: I tried adding the inverse quantization in MLC-LLM as an operator into a 3rd party OPpackage, the cpu version of which I can verify. However, HTP has caused great difficulties for me. Whether it is the completeness of the documentation or the errors reported during compilation, I am very confused. But I'm still trying to use it, and I hope aimet can come up with a similar example. Thanks

@quic-hitameht
Copy link
Contributor

Tagging @quic-sendilk @quic-hsukumar here.

@quic-hitameht quic-hitameht added the question Further information is requested label Jan 29, 2024
@quic-mangal
Copy link
Contributor

@shifeiwen, can you explain what you mean by group quantization?

Are you looking for a Jupyter NB example which shows quantization simulation for a LLM model?

@shifeiwen
Copy link
Author

@quic-mangal
In CNN, we often use per-channel or per-layer granularity to quantify the convolution kernel. But the main operation in LLM is matrix multiplication. When performing matrix multiplication, we can use vector-grained quantization (rows or columns of tensor), such as row-by-row or vector-by-vector quantization, to obtain more accurate results. For matrix multiplication A*B=C, we will not directly use the conventional quantization method (per-tensor), but will find each row of A and each column of B to quantize, then perform integer INT calculation, and finally convert the result Returns a floating point result (per-vector). As the number of model parameters of LLM becomes larger and larger, the accuracy requirements for the application of per-vector quantization in LLM are getting higher and higher. The methods of row-by-row quantization of X and column-by-column quantification of W can no longer meet the error requirements, so it is now common to The FP16 elements of each row (each column) are grouped into a group of k in order (k is generally an integer power of 2). The common k numbers are 256 and 128. This is also called per-group quantization.
Relatively excellent quantization solutions are similar to gptq or awq, etc. I don’t know if my explanation can make you understand. According to my understanding, the current quantization in qnn can only follow the per-channl method. Of course, all of the above are for PTQ.

@quic-mangal
Copy link
Contributor

@shifeiwen, we don't support block quantization ATM. Only per-tensor and per-channel quantization are supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants