Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization support #163

Open
generalsvr opened this issue Oct 16, 2023 · 8 comments
Open

Quantization support #163

generalsvr opened this issue Oct 16, 2023 · 8 comments

Comments

@generalsvr
Copy link

How to use 8bit quantized models? Can I run GGML/GGUF models?

@hiworldwzj
Copy link
Collaborator

8bit weightonly quantized only support llama now

@generalsvr
Copy link
Author

Any examples?

@hiworldwzj
Copy link
Collaborator

parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")

@XHPlus
Copy link
Contributor

XHPlus commented Oct 19, 2023

As for the model file format, we have not tested GGML/GGUF up to now. What is the motivation to use these formats?

@JustinLin610
Copy link

Will GPTQ be supported?

@suhjohn
Copy link

suhjohn commented Nov 14, 2023

@XHPlus There's a lot of open source models on HuggingFace driven by https://huggingface.co/TheBloke. Many people in the open source community use those quantized models on TGI / vLLM.

@adi
Copy link

adi commented Feb 8, 2024

parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")

Using this option with Llama2-13B gives this error:

_get_exception_class.<locals>.Derived: 'LlamaTransformerLayerWeightQuantized' object has no attribute 'quantize_weight'

I tried both --mode int8kv int4weight and --mode int8kv int4weight

Any suggestions how to fix this?

@VfBfoerst
Copy link

@XHPlus Quantization is partially the only way to run bigger models in smaller GPUs, e.g. Mixtral. With vLLM, I can run mixtral quantized with 48 GBs of VRAM. The unquantized model would use up to 100GB VRam i guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants