Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( #182

Open
cdsnow opened this issue Aug 25, 2023 · 2 comments

Comments

@cdsnow
Copy link

cdsnow commented Aug 25, 2023

Greetings!

Following the instructions, I've completed an installation and everything seemed to work including the generation of the MSA.
Specifically, I've done the recommended conda installation, the pip installation of triton, and the local download/unpack of the datasets.
Per my reading, the remainder of the instructions (e.g. Docker) seemed optional, so I jumped directly to trying inference.sh.

However, I'm hitting a repeatable Runtime CUDA error.
Since the same error occurs when I try the benchmark run, I'll paste the output for that at the bottom.
Keeping an eye on the VRAM, this does not seem to be an issue involving a lack of memory on the GPU (a RTX 3090)
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Any advice!?
Best wishes,
-Chris

(fastfold) csnow@icestorm:~/code/FastFold/benchmark$ torchrun --nproc_per_node=1 perf.py --msa-length 128 --res-length 256
[08/25/23 10:33:06] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/25/23 10:33:07] INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/colossalai/initialize.py:116 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1
Traceback (most recent call last):
File "perf.py", line 187, in
main()
File "perf.py", line 152, in main
layer_inputs = attn_layers[lyr_idx].forward(*layer_inputs, node_mask, pair_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/evoformer.py", line 65, in forward
m = self.msa(m, z, msa_mask)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 143, in forward
node = self.MSARowAttentionWithPairBias(node, pair, node_mask_row)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/fastfold-0.2.0-py3.8-linux-x86_64.egg/fastfold/model/fastnn/msa.py", line 63, in forward
b = F.linear(Z, self.linear_b_weights)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 429806) of binary: /home/csnow/anaconda3/envs/fastfold/bin/python
Traceback (most recent call last):
File "/home/csnow/anaconda3/envs/fastfold/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/csnow/anaconda3/envs/fastfold/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

perf.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-25_10:33:10
host : icestorm
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 429806)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@addsg
Copy link

addsg commented Aug 31, 2023

hi, I met the same problem as you. And I solved it at last.
I think you need to check whether your cuda version is matched with this project.
In this project, the torch version is 1.12.1 , it means that your cuda version must be one of [10.2 11.3 11.6]

@bj600800
Copy link

find you cudatoolkit location, which nvcc.
Make sure you are calling the cudatoolkit=11.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants