Support LLM large models #2678

shifeiwen · 2024-01-28T03:04:32Z

Motivation: LLM is currently revolutionizing people's lives, and Qualcomm's mobile devices play an important role in people's lives. Qualcomm promotes on the Internet that it can achieve a 7B LLM model with a decoding speed of 20 token/s. I think AImet must be involved in this. However, some current mobile LLM quantification technologies are based on W4A16 group quantization. I want to know when AImet can An example is given to perform group quantization of W4A16 on the open source LLM model. This way I can try to reach a higher level with QNN inference.
Request: Regarding W4A16 grouping quantization of LLM models.
Current attempt: I tried adding the inverse quantization in MLC-LLM as an operator into a 3rd party OPpackage, the cpu version of which I can verify. However, HTP has caused great difficulties for me. Whether it is the completeness of the documentation or the errors reported during compilation, I am very confused. But I'm still trying to use it, and I hope aimet can come up with a similar example. Thanks

quic-hitameht · 2024-01-29T04:25:52Z

Tagging @quic-sendilk @quic-hsukumar here.

quic-mangal · 2024-02-15T18:31:26Z

@shifeiwen, can you explain what you mean by group quantization?

Are you looking for a Jupyter NB example which shows quantization simulation for a LLM model?

shifeiwen · 2024-02-19T03:13:13Z

@quic-mangal
In CNN, we often use per-channel or per-layer granularity to quantify the convolution kernel. But the main operation in LLM is matrix multiplication. When performing matrix multiplication, we can use vector-grained quantization (rows or columns of tensor), such as row-by-row or vector-by-vector quantization, to obtain more accurate results. For matrix multiplication A*B=C, we will not directly use the conventional quantization method (per-tensor), but will find each row of A and each column of B to quantize, then perform integer INT calculation, and finally convert the result Returns a floating point result (per-vector). As the number of model parameters of LLM becomes larger and larger, the accuracy requirements for the application of per-vector quantization in LLM are getting higher and higher. The methods of row-by-row quantization of X and column-by-column quantification of W can no longer meet the error requirements, so it is now common to The FP16 elements of each row (each column) are grouped into a group of k in order (k is generally an integer power of 2). The common k numbers are 256 and 128. This is also called per-group quantization.
Relatively excellent quantization solutions are similar to gptq or awq, etc. I don’t know if my explanation can make you understand. According to my understanding, the current quantization in qnn can only follow the per-channl method. Of course, all of the above are for PTQ.

quic-mangal · 2024-02-21T17:41:18Z

@shifeiwen, we don't support block quantization ATM. Only per-tensor and per-channel quantization are supported.

quic-hitameht added the question Further information is requested label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LLM large models #2678

Support LLM large models #2678

shifeiwen commented Jan 28, 2024

quic-hitameht commented Jan 29, 2024

quic-mangal commented Feb 15, 2024

shifeiwen commented Feb 19, 2024

quic-mangal commented Feb 21, 2024

Support LLM large models #2678

Support LLM large models #2678

Comments

shifeiwen commented Jan 28, 2024

quic-hitameht commented Jan 29, 2024

quic-mangal commented Feb 15, 2024

shifeiwen commented Feb 19, 2024

quic-mangal commented Feb 21, 2024