Possible to quantize into 4-bit and 8-bit and still use the models #24

regstuff · 2023-04-05T06:56:54Z

Hi, was wondering if it's possible to do something like a GPTQ quantization into 8 or 4 bit and be able to use the embeddings from the models.
GPTQ 4-bit models perform quite well compared to fp16 & 32 in text generation. Wasn't sure if such a thing would work for embeddings.
Any suggestions?

Muennighoff · 2023-04-05T10:28:52Z

I havn't looked into that. It would likely reduce the expressivity of the embeddings, so I would expect worse results, but it may still be good enough to make the saved compute worth it.

In usual language model modelling the final output vectors are reduced to discrete tokens, so being off by e.g. 0.0001 due to precision may not change the generated token, hence performance impacts are small.
In embeddings, however, the continuous output vectors are directly used to compare with other vectors e.g. via cosine similarity. Being off by 0.0001 is guaranteed to change the resulting similarity score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to quantize into 4-bit and 8-bit and still use the models #24

Possible to quantize into 4-bit and 8-bit and still use the models #24

regstuff commented Apr 5, 2023

Muennighoff commented Apr 5, 2023

Possible to quantize into 4-bit and 8-bit and still use the models #24

Possible to quantize into 4-bit and 8-bit and still use the models #24

Comments

regstuff commented Apr 5, 2023

Muennighoff commented Apr 5, 2023