-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Flash attention cannot be used on v0.5.3 #468
Comments
Looks like installing flash-attn with our torch version doesn't work:
I'll look into it. Thanks for reporting. |
I have flash attention installed and compiled it from source to support new torch but it still says it isn't found. Will double check it. I recompiled it again after deleting build and dist. Sadly doesn't work on 3 GPUs and 5bit 70b won't fit on 2 despite fitting in textgen. |
It seems to work in the new commit now |
I can use it and it works, but its slightly slower, 9tok/s activated, 11.5 tok/s deactivated, inference on Llama3-70B-8bpw, 4x3090 gpu. |
I thought VLLM supported a triton based FA for all (tensor) cards, I was hoping to try it here but instead it used the normal FA package. |
It actually stopped working again now when i try to reinstall on the latest commit. Not sure why it worked previously once. |
same here |
Your current environment
🐛 Describe the bug
I just git cloned fresh then ran ./update-runtime.sh. Then installed flash-attn with ./runtime pip install flash-attn.
Results in aphrodite not using flash-attention still even though flash-attn is installed already.
The text was updated successfully, but these errors were encountered: