Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

Open
Anthonyg5005 opened this issue May 9, 2024 · 3 comments
Open

[Feature]: Exllamav2 Q4, Q6, and Q8 cache #463

Anthonyg5005 opened this issue May 9, 2024 · 3 comments

Comments

@Anthonyg5005
Copy link

Anthonyg5005 commented May 9, 2024

馃殌 The feature, motivation and pitch

Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.

Additional context

Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md

@AlpinDale
Copy link
Member

It's definitely a planned feature. I believe @sgsdxzy wanted to work on it.

@Anthonyg5005
Copy link
Author

alright, feel free to close this issue when that's done.

@Anthonyg5005
Copy link
Author

Anthonyg5005 commented Jun 8, 2024

also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch

@Anthonyg5005 Anthonyg5005 changed the title [Feature]: Exllamav2 Q4 cache [Feature]: Exllamav2 Q4, Q6, and Q8 cache Jun 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants