Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

Open
maxin9966 opened this issue May 19, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@maxin9966
Copy link

Your current environment

Driver Version: 545.23.08
CUDA Version: 12.3
python3.9
vllm 0.4.2
flash_attn 2.4.2~2.5.8 (I have tried various versions of flash_attn)
torch 2.3

🐛 Describe the bug

Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
Using XFormers backend.

@maxin9966 maxin9966 added the bug Something isn't working label May 19, 2024
@maxin9966
Copy link
Author

4070 ti super
ubuntu 22

@bbeijy
Copy link

bbeijy commented May 20, 2024

I also met the problem.

@atineoSE
Copy link

Since #4686, we can use vllm-flash-attn instead of flash-attn.

This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.

git clone git@github.com:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .

This builds the container up to the build stage, which will contain the wheel for vllm in the /workspace/dist directory. We can then extract it with docker cp.

Then install with:

pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp310-cp310-linux_x86_64.whl

Now you can run vllm and get:

Using FlashAttention-2 backend.

@maxin9966
Copy link
Author

@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti?

@simonwei97
Copy link

simonwei97 commented May 24, 2024

I have same problelm on Linux (CentOS 7).

My Env

torch                             2.3.0
xformers                          0.0.26.post1
vllm                              0.4.2
vllm-flash-attn                   2.5.8.post2
vllm_nccl_cu12                    2.18.1.0.4.0

CUDA

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   25C    P0    56W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |

Probelm:

INFO 05-24 16:04:56 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-24 16:04:56 selector.py:32] Using XFormers backend.

@mces89
Copy link

mces89 commented May 25, 2024

@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks.

@atineoSE
Copy link

@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above.

Alternatively, you can:

  • use the Docker version of the current release v0.4.2, as explained here (support for flash-attn-2 built in)
  • wait until the next version is released for the pip version, as explained here (support for vllm-flash-attn will be available)

@dymil
Copy link

dymil commented May 28, 2024

There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran
pip install vllm@git+https://github.com/vllm-project/vllm
That seemed to get me further (now I'm dealing with an unrelated error, so I can't confirm everything's peachy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants