Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837

Closed
alexeygolyshev opened this issue Oct 14, 2019 · 87 comments
Closed
Assignees
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general module: windows Windows support for PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@alexeygolyshev
Copy link

alexeygolyshev commented Oct 14, 2019

🐛 Bug

No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-68308ed1e055> in <module>
     35         cum_loss.append(loss.item())
     36 
---> 37         loss.backward()
     38         optimizer.step()
     39 

C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    148                 products. Defaults to ``False``.
    149         """
--> 150         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    151 
    152     def register_hook(self, hook):

C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: CUDA error: unspecified launch failure

cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @ngimel @peterjc123

@vincentqb vincentqb added module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general labels Oct 14, 2019
@vincentqb
Copy link
Contributor

Can you provide a minimal code example to reproduce? Please also copy and paste the output from our environment collection script. You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

@vincentqb vincentqb added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 14, 2019
@alexeygolyshev
Copy link
Author

Hello @vincentqb,

Code example: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Output from the environment collection script:

Collecting environment information...
PyTorch version: 1.3.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.3.0
[pip] torchvision==0.4.1
[conda] blas                      1.0                         mkl
[conda] libblas                   3.8.0                    13_mkl    conda-forge
[conda] libcblas                  3.8.0                    13_mkl    conda-forge
[conda] liblapack                 3.8.0                    13_mkl    conda-forge
[conda] mkl                       2019.4                      245
[conda] mkl-service               2.3.0            py37hb782905_0
[conda] pytorch                   1.3.0           py3.7_cuda101_cudnn7_0    pytorch
[conda] torchvision               0.4.1                py37_cu101    pytorch

nvidia-smi:

Mon Oct 14 21:05:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 436.48       Driver Version: 436.48       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   63C    P2    28W /  N/A |   1103MiB /  6144MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6156      C   C:\Anaconda3\envs\torch12\python.exe       N/A      |
+-----------------------------------------------------------------------------+

@albanD
Copy link
Collaborator

albanD commented Oct 14, 2019

@alexeygolyshev is that a minimal example? Looks like there is a lot of code in there.
If you could reduce the size of the code, it would really help with finding what is the root cause, thanks !

@alexeygolyshev
Copy link
Author

Hello @albanD,
Yes, this is a minimal example. I don't think I can greatly reduce the code. I have already deleted the data preprocessing.

@alexeygolyshev
Copy link
Author

@albanD My inputs: [sentences, words, characters]. I have 2 varying dimensions: different number of words in a sentence and different number of characters in a word.

@albanD
Copy link
Collaborator

albanD commented Oct 14, 2019

Unfortunately I don't have a setup with notebook available. Could you run your code with anomaly_mode enabled and post here the extended stack trace?

@Huer-H
Copy link

Huer-H commented Dec 30, 2019

🐛 Bug

No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-68308ed1e055> in <module>
     35         cum_loss.append(loss.item())
     36 
---> 37         loss.backward()
     38         optimizer.step()
     39 

C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    148                 products. Defaults to ``False``.
    149         """
--> 150         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    151 
    152     def register_hook(self, hook):

C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: CUDA error: unspecified launch failure

cc @ezyang @ssnl @albanD @zou3519 @gqchen

Hello, my computer system is the same as yours,【Win10(1903),Python 3.7.4, RTX 2060 (driver version 441.20),torch.version==1.2.0,】
I encountered the same problem as you. Have you solved it now?

you say No problem in PyTorch 1.2. Can you tell me all the information in this version?
CUDA?CUDNN?Python?

@alexeygolyshev
Copy link
Author

alexeygolyshev commented Dec 30, 2019

Hello @JYH9351,
I am currently using PyTorch 1.3.0 in production. I don't know why, but this helps:

with t.autograd.set_detect_anomaly(False):
    for epoch in range(epochs):
        ...

Crashes less frequently, not in the first 2 epochs.

@peterjc123
Copy link
Collaborator

Does switching off the TDR settings helps? https://zhuanlan.zhihu.com/p/38141415

@alexeygolyshev
Copy link
Author

No. TDR = 60. Run 2 times. Crashed in epochs 2 and 11. This error appears randomly.
with t.autograd.set_detect_anomaly(True) increases time per epoch in 5x. In October, I waited several hours, but there was no error. So there is no extended stack trace.
Sometimes with t.autograd.set_detect_anomaly(False) can increase time without errors. But I am not sure. In October, I trained several networks with a 2-day uptime. But in later experiments, it also crashed randomly.

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    28    62     -    12     0     0     0  6801   960
    0    30    62     -    11     0     0     0  6801  1155
    0    32    63     -    20     7     0     0  6801  1155
    0    32    62     -    13     1     0     0  6801   960
    0    28    62     -    15     1     0     0  6801   960
    0    28    62     -    16     1     0     0  6801   960
    0    28    62     -    15     1     0     0  6801   960
    0    28    63     -    14     1     0     0  6801   960
    0    27    63     -    13     0     0     0  6801   960
    0    28    62     -    11     3     0     0  6801   960
    0    28    62     -     0     0     0     0  6801   960
    0    12    62     -     0     0     0     0   810   345
    0     5    61     -     0     0     0     0   405   345

@peterjc123
Copy link
Collaborator

I have to say that it is difficult to say where the problem is without the stacktrace including the exact crash site. But we may get that with the help of a RelWithDebInfo build and the attachment of the VS debugger. I could build one for you if you have trouble in building the project.

@alexeygolyshev
Copy link
Author

It will be great if you can prepare the debug build. I don't have much experience.

@peterjc123
Copy link
Collaborator

Interesting.

@hendrycks
Copy link

hendrycks commented Jan 2, 2020

I had this issue training a model from https://github.com/wgrathwohl/JEM with PyTorch 1.3
I used this command

python train_wrn_ebm.py --lr .0001 --dataset cifar10 --optimizer adam --p_x_weight 1.0 --p_y_given_x_weight 1.0 --p_x_y_weight 0.0 --sigma .03 --width 2 --depth 40 --save_dir ./experiments --plot_uncond --warmup_iters 1000

The error happened seemingly randomly in the middle of training. I am using linux mint, not Windows.

@kice
Copy link

kice commented Jan 20, 2020

I will suggested that you try again with uninstalling GPU driver with DDU and installing the driver that comes with cuda toolkit.

Too many bugs with Nvidia GPU driver on win 10.

@dalupus
Copy link

dalupus commented Jan 20, 2020

I have run into this same issue and tried the suggestion of @kice of installing the driver from the cuda toolkit with no luck.

@Yourivdzee
Copy link

I am running into similar issues on my windows machine, I have a simple pipeline for binary classification with an LSTM and it shuts down at epochs (seems to be random).

@dalupus
Copy link

dalupus commented Jan 20, 2020

My issue is also with lstm. Interestingly when I add torch.autograd.set_detect_anomaly(True) to get a stack trace, it takes about 20% longer to train but didn't fail. I will run a few more times to see if that is consistently true.

@shingyipcheung
Copy link

shingyipcheung commented Jan 21, 2020

Same problem with LSTM + binary classification + error in random epoch on windows 10 + Pytorch 1.4

File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 108, in main train_model(train_loader, predict_score_net, optimizer) File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 41, in train_model loss.backward() File "C:\Users\User\Anaconda3\lib\site-packages\torch\tensor.py", line 195, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\User\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: unspecified launch failure

update:
GRU has the same problem.

@dalupus
Copy link

dalupus commented Jan 24, 2020

@shingyipcheung Are you able to replicate the error with torch.autograd.set_detect_anomaly(True) set in order to get a full stacktrace?

@albanD albanD added module: windows Windows support for PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Feb 7, 2020
@rothn
Copy link

rothn commented Feb 22, 2020

I'm having this issue as well! (EDIT: on latest 1.4). The network will train for awhile, then at some random point, the classifier will halt with this exception.

It is possible to reproduce by using FastAI AWD-LSTM transfer learning for text classification on a very large dataset: https://docs.fast.ai/text.html

After this happens, further CUDA operations result in the same error until the kernel is restarted.

I suspect a lot of this simply does not get tested on Windows. Professionally, I always use Linux for machine learning tasks. It just so happens that my only personal system with a GPU runs Windows and does not have space for a Linux install. Furthermore, "Ubuntu on Windows" does not support CUDA.

@MaverickDai
Copy link

I have the same issue when I train with LSTM + classification, the error occurs in random epoch on windows 10 + Pytorch 1.4
waiting for a solution

@hadypranoto
Copy link

Exception has occurred: RuntimeError
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) (gemm at ..\aten\src\ATen\cuda\CUDABlas.cpp:165)
(no backtrace available)

@ReinforcedMan
Copy link

Same error on windows, training an LSTM on a GTX 2080 TI. Happens with both Pytorch 1.5 and 1.6.
Very annoying as it seems random, and the training is completely broken when it happens.

@Jgoldfeder
Copy link

Switching from 1.6 to 1.5 and downgrading my Nvidia driver to 431.86 fixed the error for me.

@lucas-emery
Copy link

Same error while training an LSTM with a big batch size on windows, I was getting random crashes after 1 to 20 epochs. Setting torch.backends.cudnn.enabled = False fixed the issue.

Pytorch 1.5.1
Cuda 10.2.89
CuDNN 7.6.5
GTX 1070 - MSI Gaming X - Driver 445.75
Windows 10 Pro 1909 build 18363.1016

@mszhanyi
Copy link
Collaborator

mszhanyi commented Sep 9, 2020

@lucas-emery
Copy link

@mszhanyi i did try extending the TDR to 60 seconds. I was able to run a 13 hour training session after setting the TDR and restarting my pc, but the backprop time was also faster (down from 1 min to 10/15 secs), I guess it was just a coincidence and cudnn chose a different algorithm.
After that I stopped the training to update a function and when I tried to resume I couldn't get past 20 epochs without a crash, sometimes "illegal memory" and sometimes "launch failure" the backprop time went back up to 1 minute. I reverted my changes and tried to train a new model from scratch but it crashed between 1 and 20 epochs with the same errors. After setting torch.backends.cudnn.enabled = False with no code changes and no reboot it stopped crashing and backprop time went down to 20 secs. That training session lasted 12 hours with no errors.
I did two more 4 hour sessions without problems today.

@mszhanyi
Copy link
Collaborator

mszhanyi commented Sep 9, 2020

@lucas-emery , could you provide a simplified script that I could reproduce it?

@lucas-emery
Copy link

lucas-emery commented Sep 9, 2020

@mszhanyi I'm afraid it won't be possible, it's a very complex model on a reinforcement learning task. I'll let you know if I find anything else. I'll try to get something reproducible after I finish.
The error started appearing after I incremented my batch size to 1k with an unroll length of 32.

@serg06
Copy link

serg06 commented Oct 30, 2020

I'm getting this issue on my RTX 3080, and I can't even downgrade PyTorch because older versions don't support RTX 3000.

These two fixes worked for me, but both have a performance penalty:

  • os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
  • torch.backends.cudnn.enabled = False

@moe001
Copy link

moe001 commented Nov 4, 2020

Same issue on 3090+Windows10+CUDA11+PyTorch(Stable&Nightly)

These fixes worked for me,too.

  • $env:CUDA_LAUNCH_BLOCKING=1 increases the training time by 500%.
  • torch.backends.cudnn.enabled = False increases the training time by 20%.

@ysx001
Copy link

ysx001 commented Nov 15, 2020

We are facing the same issue. Tried on Ubuntu 18.04, Nvidia K80, M60, V100, all with the same pytorch version 1.6.0, cuda 11.

Applying the below fix doesn't help as well.... :(

    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
    torch.backends.cudnn.enabled = False

@mega-optimus
Copy link
Contributor

Facing same error on 2080Ti(sm_75)+Windows10+CUDA11.1, not using Pytorch.
The same code runs without problem when compiled for and run on 1080Ti(sm_61).
Currently I'm sure it's not caused by using up global memory nor using up shared memory, since I reduced the code to the case where memory usage is tiny.

@jugol
Copy link

jugol commented Mar 22, 2021

Same issue in GRU + pytorch1.8 + single thread + Cuda11.1 + Windows10 + RTX3090

@alexeygolyshev
Copy link
Author

Fixed in PyTorch 1.9.0 (Windows 10, CUDA 10.2, RTX 2060)

@JeanKaddour
Copy link

JeanKaddour commented Jan 9, 2022

I recently ran into these issues too (with PyTorch 1.10.1), even with vastly different scripts (e.g. training a GNN vs. only performing inference on a ViT model). The below did not help.

 os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
 torch.backends.cudnn.enabled = False

The internal status error (when running with CUDA_LAUNCH_BLOCKING = 1) wasn't informative to me either.
vit_bug
The only pattern I noticed is that it happened when I ran multiple scripts on different GPU devices on the same machine.
For example, the above occurred when I ran two scripts simultaneously (using CUDA 0 and CUDA 2, respectively, as verified by nvidia-smi) on a machine with 4x 3090s. I suspect some CPU/RAM issues going on, but haven't dug deeper into it.

@akashsharma02
Copy link

@JeanKaddour Were you able to get to the bottom of it. I seem to be observing similar issues with Pytorch 1.10.1 as well.

@Saafke
Copy link

Saafke commented Jan 29, 2022

I also have this issue. Pytorch 1.9.0 + CUDA10.2 + Python3.7 + Ubuntu 18.04

@prabhatkumar95
Copy link

prabhatkumar95 commented Mar 17, 2022

Have the same issue on Pytorch (1.10/1.11/1.12(source build)) + Ubuntu 20.04 + Python 3.8/Python 3.9 + CUDA 11.2/11.6

A6000 / RTX 3090 GPU
AMD Threadripper Pro 3975WX

@akashsharma02 did you find any solution?
@alexeygolyshev probably the issue should be reopend.

@alexeygolyshev
Copy link
Author

Hello @prabhatkumar95,

I can't reproduce the error from my first post in this thread. And the speed is good (on the same hardware): 6 seconds per epoch now vs 24 seconds 2 years ago.

Windows 10, Python 3.10.0, PyTorch 1.11, CUDA 11.3.1, RTX 2060

But I reopened the issue on your request.

@prabhatkumar95
Copy link

prabhatkumar95 commented Mar 17, 2022

Hi @alexeygolyshev thanks, this issue now is as per my guess CPU dependent with Intel CPUs running as normal but AMD having the issue. When tried with both RTX 3090 and A6000.

As the error is the same I wanted to keep everything in the same thread. Issue reported is here

@ngimel
Copy link
Collaborator

ngimel commented Mar 17, 2022

Closing, as there is the new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general module: windows Windows support for PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests