[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

Anthonyg5005 · 2024-05-09T18:47:59Z

🚀 The feature, motivation and pitch

Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.

Additional context

Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md

AlpinDale · 2024-05-09T18:50:32Z

It's definitely a planned feature. I believe @sgsdxzy wanted to work on it.

Anthonyg5005 · 2024-05-09T19:02:00Z

alright, feel free to close this issue when that's done.

Anthonyg5005 · 2024-06-08T20:56:35Z

also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch

Anthonyg5005 changed the title ~~[Feature]: Exllamav2 Q4 cache~~ [Feature]: Exllamav2 Q4, Q6, and Q8 cache Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

Anthonyg5005 commented May 9, 2024 •

edited

AlpinDale commented May 9, 2024

Anthonyg5005 commented May 9, 2024

Anthonyg5005 commented Jun 8, 2024 •

edited

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

Comments

Anthonyg5005 commented May 9, 2024 • edited

🚀 The feature, motivation and pitch

Additional context

AlpinDale commented May 9, 2024

Anthonyg5005 commented May 9, 2024

Anthonyg5005 commented Jun 8, 2024 • edited

Anthonyg5005 commented May 9, 2024 •

edited

Anthonyg5005 commented Jun 8, 2024 •

edited