Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to quantize into 4-bit and 8-bit and still use the models #24

Open
regstuff opened this issue Apr 5, 2023 · 1 comment
Open

Comments

@regstuff
Copy link

regstuff commented Apr 5, 2023

Hi, was wondering if it's possible to do something like a GPTQ quantization into 8 or 4 bit and be able to use the embeddings from the models.
GPTQ 4-bit models perform quite well compared to fp16 & 32 in text generation. Wasn't sure if such a thing would work for embeddings.
Any suggestions?

@Muennighoff
Copy link
Owner

I havn't looked into that. It would likely reduce the expressivity of the embeddings, so I would expect worse results, but it may still be good enough to make the saved compute worth it.

In usual language model modelling the final output vectors are reduced to discrete tokens, so being off by e.g. 0.0001 due to precision may not change the generated token, hence performance impacts are small.
In embeddings, however, the continuous output vectors are directly used to compare with other vectors e.g. via cosine similarity. Being off by 0.0001 is guaranteed to change the resulting similarity score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants