Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flash attention cannot be used on v0.5.3 #468

Open
Nero10578 opened this issue May 12, 2024 · 7 comments
Open

[Bug]: Flash attention cannot be used on v0.5.3 #468

Nero10578 opened this issue May 12, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Nero10578
Copy link

Your current environment

./runtime.sh python env.py
Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 552.22
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
CPU family:                         6
Model:                              167
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           7007.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           4 MiB (8 instances)
L3 cache:                           16 MiB (1 instance)
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] blas                      2.16                        mkl    conda-forge
[conda] libblas                   3.8.0                    16_mkl    conda-forge
[conda] libcblas                  3.8.0                    16_mkl    conda-forge
[conda] liblapack                 3.8.0                    16_mkl    conda-forge
[conda] liblapacke                3.8.0                    16_mkl    conda-forge
[conda] mkl                       2020.2                      256
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch                   2.3.0           py3.11_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchtriton               2.3.0                     py311    pytorchROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

I just git cloned fresh then ran ./update-runtime.sh. Then installed flash-attn with ./runtime pip install flash-attn.

Results in aphrodite not using flash-attention still even though flash-attn is installed already.

./runtime.sh python -m aphrodite.endpoints.openai.api_server \
--model /home/owen/models/Llama-3-8B-Instruct-COT-v0.1 \
--gpu-memory-utilization 0.80 --max-model-len 8192 --port 8000 --kv-cache-dtype fp8 \
--served-model-name OwenTest --enforce-eager true --max-num-seqs 160
INFO:     Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it
may cause slight accuracy drop without scaling factors. FP8_E5M2 (without scaling) is only supported on cuda version
greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for common inference criteria.
INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
INFO:     Model = '/home/owen/models/Llama-3-8B-Instruct-COT-v0.1'
INFO:     Speculative Config = None
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = True
INFO:     KV Cache Data Type = fp8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='outlines')
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO:     Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better
performance.
INFO:     Using XFormers backend.
INFO:     Model weights loaded. Memory usage: 14.96 GiB x 1 = 14.96 GiB
INFO:     # GPU blocks: 3082, # CPU blocks: 4096
INFO:     Minimum concurrency: 6.02x
INFO:     Maximum sequence length allowed in the cache: 49312
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Using the default chat template
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Started server process [11788]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
@Nero10578 Nero10578 added the bug Something isn't working label May 12, 2024
@AlpinDale
Copy link
Member

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

@Ph0rk0z
Copy link

Ph0rk0z commented May 14, 2024

I have flash attention installed and compiled it from source to support new torch but it still says it isn't found. Will double check it.

I recompiled it again after deleting build and dist. Sadly doesn't work on 3 GPUs and 5bit 70b won't fit on 2 despite fitting in textgen.

@Nero10578
Copy link
Author

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

It seems to work in the new commit now

@ortegaalfredo
Copy link

ortegaalfredo commented May 15, 2024

I can use it and it works, but its slightly slower, 9tok/s activated, 11.5 tok/s deactivated, inference on Llama3-70B-8bpw, 4x3090 gpu.

@Ph0rk0z
Copy link

Ph0rk0z commented May 16, 2024

I thought VLLM supported a triton based FA for all (tensor) cards, I was hoping to try it here but instead it used the normal FA package.

@Nero10578
Copy link
Author

Looks like installing flash-attn with our torch version doesn't work:

ImportError: /home/anon/miniconda3/envs/aphrodite/lib/python3.11/site-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

I'll look into it. Thanks for reporting.

It seems to work in the new commit now

It actually stopped working again now when i try to reinstall on the latest commit. Not sure why it worked previously once.

@alexanderfrey
Copy link

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants