Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for INT4/UINT4 #1712

Open
WilliamTambellini opened this issue Aug 29, 2023 · 7 comments
Open

Add support for INT4/UINT4 #1712

WilliamTambellini opened this issue Aug 29, 2023 · 7 comments
Assignees
Labels
enhancement A feature or an optimization request

Comments

@WilliamTambellini
Copy link
Contributor

Summary

Add support for INT4 and/or UINT4
Refs:
https://intellabs.github.io/distiller/quantization.html
https://developer.nvidia.com/blog/int4-for-ai-inference/
https://arxiv.org/abs/2301.12017
https://arxiv.org/pdf/2306.11987.pdf
https://www.xilinx.com/support/documents/white_papers/wp521-4bit-optimization.pdf

Problem statement

Describe the problem you are trying to solve with reasonable level of details.
Fast low 4bit precision quantized matmul.

Preferred solution

A new onednn datatype and at least quantmatmul (no need for full arithm/math).

@WilliamTambellini WilliamTambellini added the enhancement A feature or an optimization request label Aug 29, 2023
@vpirogov
Copy link
Member

vpirogov commented Sep 1, 2023

This is on our radar, in particular in context of large language models. As the references indicate this area is an active field of research, in particular focusing on techniques to minimize accuracy loss. Are there any specific quantization approaches and usage models you have in mind? Anything validated in production setting?

@vpirogov vpirogov self-assigned this Sep 1, 2023
@WilliamTambellini
Copy link
Contributor Author

WilliamTambellini commented Sep 7, 2023

Are there any specific quantization approaches and usage models you have in mind?

Cannot legally say much but there are already some opensource LLM quantizers :
https://github.com/PanQiWei/AutoGPTQ
but they could requires some samples to quantize on.
Be aware of static vs dynamic quant.
Most Transformer decoder would do the job for you to test, eg:
https://huggingface.co/Qwen/Qwen-7B-Chat-Int4#%E9%87%8F%E5%8C%96-quantization

Anything validated in production setting?

Cannot reply publicly but as long as the perplexity after quantization is "close" to the one with bf16 then you should be good.

Could you confirm that SapphireRapids CPUs (4th gen) dont seem to have any hardware support (AMX, ..) for s4 nor u4 math ?

@vpirogov
Copy link
Member

@WilliamTambellini, you don't necessarily need s4/u4 math to take advantage of low precision. I believe most viable use cases focus on using s4/u4 as a storage formats for weights with math being done in int8 or fp16. So for oneDNN the question effectively boils down to what quantization scheme for these data types would be viable.

@WilliamTambellini
Copy link
Contributor Author

fyi
onnx/onnx#5811

@igorsafo
Copy link
Contributor

Hi @WilliamTambellini ,
Yes, we are aware of int4 support in OpenVino. The following RFCs target GPT-Q support in oneDNN:

@vpirogov vpirogov assigned igorsafo and unassigned vpirogov Dec 19, 2023
@vpirogov
Copy link
Member

API, validation, and GPU optimizations for int4 landed into main branch targeting oneDNN v3.5.

@WilliamTambellini
Copy link
Contributor Author

tks @vpirogov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request
Projects
None yet
Development

No branches or pull requests

3 participants