RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

cdsnow · 2023-08-25T16:46:46Z

Greetings!

Following the instructions, I've completed an installation and everything seemed to work including the generation of the MSA.
Specifically, I've done the recommended conda installation, the pip installation of triton, and the local download/unpack of the datasets.
Per my reading, the remainder of the instructions (e.g. Docker) seemed optional, so I jumped directly to trying inference.sh.

However, I'm hitting a repeatable Runtime CUDA error.
Since the same error occurs when I try the benchmark run, I'll paste the output for that at the bottom.
Keeping an eye on the VRAM, this does not seem to be an issue involving a lack of memory on the GPU (a RTX 3090)
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Any advice!?
Best wishes,
-Chris

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256
[08/25/23 10:33:06] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/25/23 10:33:07] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
File "perf.py", line 187, in
main()
File "perf.py", line 152, in main
layer_inputs = attn_layers[lyr_idx].forward(layer_inputs, node_mask, pair_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/evoformer.py", line 65, in forward
m = self.msa(m, z, msa_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 143, in forward
node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 63, in forward
b = F.linear(Z, self.linear_b_weights)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429806) of binary: /home/csnow/anaconda3/envs/fastfold/bin/python
Traceback (most recent call last):
File "/home/csnow/anaconda3/envs/fastfold/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 345, in wrapper
return f(*args, kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

perf.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-25_10:33:10
host : icestorm
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 429806)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

addsg · 2023-08-31T11:54:14Z

hi, I met the same problem as you. And I solved it at last.
I think you need to check whether your cuda version is matched with this project.
In this project, the torch version is 1.12.1 , it means that your cuda version must be one of [10.2 11.3 11.6]

bj600800 · 2024-04-12T01:27:54Z

find you cudatoolkit location, which nvcc.
Make sure you are calling the cudatoolkit=11.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

cdsnow commented Aug 25, 2023

addsg commented Aug 31, 2023

bj600800 commented Apr 12, 2024

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

Comments

cdsnow commented Aug 25, 2023

perf.py FAILED

Failures: <NO_OTHER_FAILURES>

addsg commented Aug 31, 2023

bj600800 commented Apr 12, 2024

Failures:
<NO_OTHER_FAILURES>