PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837

alexeygolyshev · 2019-10-14T09:40:16Z

🐛 Bug

No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-68308ed1e055> in <module>
     35         cum_loss.append(loss.item())
     36 
---> 37         loss.backward()
     38         optimizer.step()
     39 

C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    148                 products. Defaults to ``False``.
    149         """
--> 150         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    151 
    152     def register_hook(self, hook):

C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: CUDA error: unspecified launch failure

cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @ngimel @peterjc123

The text was updated successfully, but these errors were encountered:

vincentqb · 2019-10-14T17:31:27Z

Can you provide a minimal code example to reproduce? Please also copy and paste the output from our environment collection script. You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

alexeygolyshev · 2019-10-14T18:06:53Z

Hello @vincentqb,

Code example: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Output from the environment collection script:

Collecting environment information...
PyTorch version: 1.3.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Microsoft Windows 10 Enterprise
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.15.4
[pip] torch==1.3.0
[pip] torchvision==0.4.1
[conda] blas                      1.0                         mkl
[conda] libblas                   3.8.0                    13_mkl    conda-forge
[conda] libcblas                  3.8.0                    13_mkl    conda-forge
[conda] liblapack                 3.8.0                    13_mkl    conda-forge
[conda] mkl                       2019.4                      245
[conda] mkl-service               2.3.0            py37hb782905_0
[conda] pytorch                   1.3.0           py3.7_cuda101_cudnn7_0    pytorch
[conda] torchvision               0.4.1                py37_cu101    pytorch

nvidia-smi:

Mon Oct 14 21:05:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 436.48       Driver Version: 436.48       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   63C    P2    28W /  N/A |   1103MiB /  6144MiB |     13%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6156      C   C:\Anaconda3\envs\torch12\python.exe       N/A      |
+-----------------------------------------------------------------------------+

albanD · 2019-10-14T18:15:36Z

@alexeygolyshev is that a minimal example? Looks like there is a lot of code in there.
If you could reduce the size of the code, it would really help with finding what is the root cause, thanks !

alexeygolyshev · 2019-10-14T18:25:58Z

Hello @albanD,
Yes, this is a minimal example. I don't think I can greatly reduce the code. I have already deleted the data preprocessing.

alexeygolyshev · 2019-10-14T20:24:11Z

@albanD My inputs: [sentences, words, characters]. I have 2 varying dimensions: different number of words in a sentence and different number of characters in a word.

albanD · 2019-10-14T21:17:53Z

Unfortunately I don't have a setup with notebook available. Could you run your code with anomaly_mode enabled and post here the extended stack trace?

Huer-H · 2019-12-30T08:53:34Z

🐛 Bug

No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip

Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-68308ed1e055> in <module>
     35         cum_loss.append(loss.item())
     36 
---> 37         loss.backward()
     38         optimizer.step()
     39 

C:\Anaconda3\envs\torch13\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
    148                 products. Defaults to ``False``.
    149         """
--> 150         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    151 
    152     def register_hook(self, hook):

C:\Anaconda3\envs\torch13\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: CUDA error: unspecified launch failure

cc @ezyang @ssnl @albanD @zou3519 @gqchen

Hello, my computer system is the same as yours,【Win10(1903),Python 3.7.4, RTX 2060 (driver version 441.20),torch.version==1.2.0,】
I encountered the same problem as you. Have you solved it now?

you say No problem in PyTorch 1.2. Can you tell me all the information in this version？
CUDA?CUDNN?Python?

alexeygolyshev · 2019-12-30T13:00:18Z

Hello @JYH9351,
I am currently using PyTorch 1.3.0 in production. I don't know why, but this helps:

with t.autograd.set_detect_anomaly(False):
    for epoch in range(epochs):
        ...

Crashes less frequently, not in the first 2 epochs.

peterjc123 · 2019-12-30T14:57:34Z

Does switching off the TDR settings helps? https://zhuanlan.zhihu.com/p/38141415

alexeygolyshev · 2019-12-30T15:43:57Z

No. TDR = 60. Run 2 times. Crashed in epochs 2 and 11. This error appears randomly.
with t.autograd.set_detect_anomaly(True) increases time per epoch in 5x. In October, I waited several hours, but there was no error. So there is no extended stack trace.
Sometimes with t.autograd.set_detect_anomaly(False) can increase time without errors. But I am not sure. In October, I trained several networks with a 2-day uptime. But in later experiments, it also crashed randomly.

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    28    62     -    12     0     0     0  6801   960
    0    30    62     -    11     0     0     0  6801  1155
    0    32    63     -    20     7     0     0  6801  1155
    0    32    62     -    13     1     0     0  6801   960
    0    28    62     -    15     1     0     0  6801   960
    0    28    62     -    16     1     0     0  6801   960
    0    28    62     -    15     1     0     0  6801   960
    0    28    63     -    14     1     0     0  6801   960
    0    27    63     -    13     0     0     0  6801   960
    0    28    62     -    11     3     0     0  6801   960
    0    28    62     -     0     0     0     0  6801   960
    0    12    62     -     0     0     0     0   810   345
    0     5    61     -     0     0     0     0   405   345

peterjc123 · 2019-12-30T15:55:38Z

I have to say that it is difficult to say where the problem is without the stacktrace including the exact crash site. But we may get that with the help of a RelWithDebInfo build and the attachment of the VS debugger. I could build one for you if you have trouble in building the project.

alexeygolyshev · 2019-12-30T17:31:24Z

It will be great if you can prepare the debug build. I don't have much experience.

peterjc123 · 2019-12-31T08:51:45Z

Interesting.

hendrycks · 2020-01-02T20:38:35Z

I had this issue training a model from https://github.com/wgrathwohl/JEM with PyTorch 1.3
I used this command

python train_wrn_ebm.py --lr .0001 --dataset cifar10 --optimizer adam --p_x_weight 1.0 --p_y_given_x_weight 1.0 --p_x_y_weight 0.0 --sigma .03 --width 2 --depth 40 --save_dir ./experiments --plot_uncond --warmup_iters 1000

The error happened seemingly randomly in the middle of training. I am using linux mint, not Windows.

kice · 2020-01-20T12:01:38Z

I will suggested that you try again with uninstalling GPU driver with DDU and installing the driver that comes with cuda toolkit.

Too many bugs with Nvidia GPU driver on win 10.

dalupus · 2020-01-20T12:53:33Z

I have run into this same issue and tried the suggestion of @kice of installing the driver from the cuda toolkit with no luck.

Yourivdzee · 2020-01-20T17:28:40Z

I am running into similar issues on my windows machine, I have a simple pipeline for binary classification with an LSTM and it shuts down at epochs (seems to be random).

dalupus · 2020-01-20T23:24:04Z

My issue is also with lstm. Interestingly when I add torch.autograd.set_detect_anomaly(True) to get a stack trace, it takes about 20% longer to train but didn't fail. I will run a few more times to see if that is consistently true.

shingyipcheung · 2020-01-21T13:36:03Z

Same problem with LSTM + binary classification + error in random epoch on windows 10 + Pytorch 1.4

File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 108, in main train_model(train_loader, predict_score_net, optimizer) File "C:/Users/User/GoogleDrive/mad2-recommend/gnn/train.py", line 41, in train_model loss.backward() File "C:\Users\User\Anaconda3\lib\site-packages\torch\tensor.py", line 195, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\User\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: unspecified launch failure

update:
GRU has the same problem.

dalupus · 2020-01-24T22:17:12Z

@shingyipcheung Are you able to replicate the error with torch.autograd.set_detect_anomaly(True) set in order to get a full stacktrace?

rothn · 2020-02-22T19:00:06Z

I'm having this issue as well! (EDIT: on latest 1.4). The network will train for awhile, then at some random point, the classifier will halt with this exception.

It is possible to reproduce by using FastAI AWD-LSTM transfer learning for text classification on a very large dataset: https://docs.fast.ai/text.html

After this happens, further CUDA operations result in the same error until the kernel is restarted.

I suspect a lot of this simply does not get tested on Windows. Professionally, I always use Linux for machine learning tasks. It just so happens that my only personal system with a GPU runs Windows and does not have space for a Linux install. Furthermore, "Ubuntu on Windows" does not support CUDA.

MaverickDai · 2020-03-05T05:50:39Z

I have the same issue when I train with LSTM + classification, the error occurs in random epoch on windows 10 + Pytorch 1.4
waiting for a solution

hadypranoto · 2020-08-29T03:13:28Z

Exception has occurred: RuntimeError
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) (gemm at ..\aten\src\ATen\cuda\CUDABlas.cpp:165)
(no backtrace available)

ReinforcedMan · 2020-09-02T14:03:16Z

Same error on windows, training an LSTM on a GTX 2080 TI. Happens with both Pytorch 1.5 and 1.6.
Very annoying as it seems random, and the training is completely broken when it happens.

Jgoldfeder · 2020-09-04T00:59:05Z

Switching from 1.6 to 1.5 and downgrading my Nvidia driver to 431.86 fixed the error for me.

lucas-emery · 2020-09-08T22:33:15Z

Same error while training an LSTM with a big batch size on windows, I was getting random crashes after 1 to 20 epochs. Setting torch.backends.cudnn.enabled = False fixed the issue.

Pytorch 1.5.1
Cuda 10.2.89
CuDNN 7.6.5
GTX 1070 - MSI Gaming X - Driver 445.75
Windows 10 Pro 1909 build 18363.1016

mszhanyi · 2020-09-09T00:26:57Z

@lucas-emery, did you try extend the TDR display or disable TDR as https://developer.download.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGuide/HTML/Content/Timeout_Detection_Recovery.htm

lucas-emery · 2020-09-09T02:13:15Z

@mszhanyi i did try extending the TDR to 60 seconds. I was able to run a 13 hour training session after setting the TDR and restarting my pc, but the backprop time was also faster (down from 1 min to 10/15 secs), I guess it was just a coincidence and cudnn chose a different algorithm.
After that I stopped the training to update a function and when I tried to resume I couldn't get past 20 epochs without a crash, sometimes "illegal memory" and sometimes "launch failure" the backprop time went back up to 1 minute. I reverted my changes and tried to train a new model from scratch but it crashed between 1 and 20 epochs with the same errors. After setting torch.backends.cudnn.enabled = False with no code changes and no reboot it stopped crashing and backprop time went down to 20 secs. That training session lasted 12 hours with no errors.
I did two more 4 hour sessions without problems today.

mszhanyi · 2020-09-09T09:04:02Z

@lucas-emery , could you provide a simplified script that I could reproduce it?

lucas-emery · 2020-09-09T15:12:32Z

@mszhanyi I'm afraid it won't be possible, it's a very complex model on a reinforcement learning task. I'll let you know if I find anything else. I'll try to get something reproducible after I finish.
The error started appearing after I incremented my batch size to 1k with an unroll length of 32.

serg06 · 2020-10-30T03:13:10Z

I'm getting this issue on my RTX 3080, and I can't even downgrade PyTorch because older versions don't support RTX 3000.

These two fixes worked for me, but both have a performance penalty:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
torch.backends.cudnn.enabled = False

moe001 · 2020-11-04T06:18:22Z

Same issue on 3090+Windows10+CUDA11+PyTorch(Stable&Nightly)

These fixes worked for me,too.

$env:CUDA_LAUNCH_BLOCKING=1 increases the training time by 500%.
torch.backends.cudnn.enabled = False increases the training time by 20%.

ysx001 · 2020-11-15T08:08:59Z

We are facing the same issue. Tried on Ubuntu 18.04, Nvidia K80, M60, V100, all with the same pytorch version 1.6.0, cuda 11.

Applying the below fix doesn't help as well.... :(

    os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
    torch.backends.cudnn.enabled = False

mega-optimus · 2020-11-25T09:52:19Z

Facing same error on 2080Ti(sm_75)+Windows10+CUDA11.1, not using Pytorch.
The same code runs without problem when compiled for and run on 1080Ti(sm_61).
Currently I'm sure it's not caused by using up global memory nor using up shared memory, since I reduced the code to the case where memory usage is tiny.

jugol · 2021-03-22T20:57:19Z

Same issue in GRU + pytorch1.8 + single thread + Cuda11.1 + Windows10 + RTX3090

alexeygolyshev · 2021-06-18T05:40:15Z

Fixed in PyTorch 1.9.0 (Windows 10, CUDA 10.2, RTX 2060)

JeanKaddour · 2022-01-09T10:48:48Z

I recently ran into these issues too (with PyTorch 1.10.1), even with vastly different scripts (e.g. training a GNN vs. only performing inference on a ViT model). The below did not help.

 os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
 torch.backends.cudnn.enabled = False

The internal status error (when running with CUDA_LAUNCH_BLOCKING = 1) wasn't informative to me either.

The only pattern I noticed is that it happened when I ran multiple scripts on different GPU devices on the same machine.
For example, the above occurred when I ran two scripts simultaneously (using CUDA 0 and CUDA 2, respectively, as verified by nvidia-smi) on a machine with 4x 3090s. I suspect some CPU/RAM issues going on, but haven't dug deeper into it.

akashsharma02 · 2022-01-24T21:53:47Z

@JeanKaddour Were you able to get to the bottom of it. I seem to be observing similar issues with Pytorch 1.10.1 as well.

Saafke · 2022-01-29T12:33:04Z

I also have this issue. Pytorch 1.9.0 + CUDA10.2 + Python3.7 + Ubuntu 18.04

prabhatkumar95 · 2022-03-17T09:36:42Z

Have the same issue on Pytorch (1.10/1.11/1.12(source build)) + Ubuntu 20.04 + Python 3.8/Python 3.9 + CUDA 11.2/11.6

A6000 / RTX 3090 GPU
AMD Threadripper Pro 3975WX

@akashsharma02 did you find any solution?
@alexeygolyshev probably the issue should be reopend.

alexeygolyshev · 2022-03-17T14:51:02Z

Hello @prabhatkumar95,

I can't reproduce the error from my first post in this thread. And the speed is good (on the same hardware): 6 seconds per epoch now vs 24 seconds 2 years ago.

Windows 10, Python 3.10.0, PyTorch 1.11, CUDA 11.3.1, RTX 2060

But I reopened the issue on your request.

prabhatkumar95 · 2022-03-17T16:11:17Z

Hi @alexeygolyshev thanks, this issue now is as per my guess CPU dependent with Intel CPUs running as normal but AMD having the issue. When tried with both RTX 3090 and A6000.

As the error is the same I wanted to keep everything in the same thread. Issue reported is here

ngimel · 2022-03-17T16:24:30Z

Closing, as there is the new issue

vincentqb added module: autograd Related to torch.autograd, and the autograd engine in general module: cuda Related to torch.cuda, and CUDA support in general labels Oct 14, 2019

vincentqb added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 14, 2019

vincentqb mentioned this issue Nov 14, 2019

CUDA error (cublasSgemm / cuda runtime error (719) / etc) #29795

Closed

alexeygolyshev mentioned this issue Dec 30, 2019

RuntimeError: CUDA error: unspecified launch failure #31702

Closed

albanD added module: windows Windows support for PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Feb 7, 2020

ezyang added the high priority label Mar 8, 2020

mazzzystar mentioned this issue Sep 9, 2020

RuntimeError: CUDA error: unspecified launch failure mazzzystar/randomCNN-voice-transfer#22

Open

berna-ylmz mentioned this issue Dec 21, 2020

RuntimeError: CUDA error: unspecified launch failure IDKiro/action-recognition#9

Closed

switiz mentioned this issue Jan 27, 2021

Deep Speech 트레인되지 않음. sooftware/kospeech#94

Closed

LysandreJik mentioned this issue Apr 1, 2021

Trainer API crashes GPUs huggingface/transformers#11020

Closed

alexeygolyshev closed this as completed Jun 18, 2021

prabhatkumar95 mentioned this issue Mar 17, 2022

RuntimeError: CUDA error: unspecified launch failure #74235

Open

alexeygolyshev reopened this Mar 17, 2022

ngimel closed this as completed Mar 17, 2022

lightvector mentioned this issue Jan 28, 2023

"error":"This version of KataGo is not enabled for distributed for Nvidia GPU Hopper100 lightvector/KataGo#748

Open

PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837

PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837

Comments

alexeygolyshev commented Oct 14, 2019 • edited by pytorch-probot bot

🐛 Bug

vincentqb commented Oct 14, 2019

alexeygolyshev commented Oct 14, 2019

albanD commented Oct 14, 2019

alexeygolyshev commented Oct 14, 2019

alexeygolyshev commented Oct 14, 2019

albanD commented Oct 14, 2019

Huer-H commented Dec 30, 2019

🐛 Bug

alexeygolyshev commented Dec 30, 2019 • edited

peterjc123 commented Dec 30, 2019

alexeygolyshev commented Dec 30, 2019

peterjc123 commented Dec 30, 2019

alexeygolyshev commented Dec 30, 2019

peterjc123 commented Dec 31, 2019

hendrycks commented Jan 2, 2020 • edited

kice commented Jan 20, 2020

dalupus commented Jan 20, 2020 • edited

Yourivdzee commented Jan 20, 2020

dalupus commented Jan 20, 2020 • edited

shingyipcheung commented Jan 21, 2020 • edited

dalupus commented Jan 24, 2020

rothn commented Feb 22, 2020 • edited

MaverickDai commented Mar 5, 2020

hadypranoto commented Aug 29, 2020

ReinforcedMan commented Sep 2, 2020

Jgoldfeder commented Sep 4, 2020

lucas-emery commented Sep 8, 2020

mszhanyi commented Sep 9, 2020

lucas-emery commented Sep 9, 2020

mszhanyi commented Sep 9, 2020

lucas-emery commented Sep 9, 2020 • edited

serg06 commented Oct 30, 2020 • edited

moe001 commented Nov 4, 2020

ysx001 commented Nov 15, 2020

mega-optimus commented Nov 25, 2020

jugol commented Mar 22, 2021 • edited

alexeygolyshev commented Jun 18, 2021

JeanKaddour commented Jan 9, 2022 • edited

akashsharma02 commented Jan 24, 2022

Saafke commented Jan 29, 2022

prabhatkumar95 commented Mar 17, 2022 • edited

alexeygolyshev commented Mar 17, 2022

prabhatkumar95 commented Mar 17, 2022 • edited

ngimel commented Mar 17, 2022

alexeygolyshev commented Oct 14, 2019 •

edited by pytorch-probot bot

alexeygolyshev commented Dec 30, 2019 •

edited

hendrycks commented Jan 2, 2020 •

edited

dalupus commented Jan 20, 2020 •

edited

dalupus commented Jan 20, 2020 •

edited

shingyipcheung commented Jan 21, 2020 •

edited

rothn commented Feb 22, 2020 •

edited

lucas-emery commented Sep 9, 2020 •

edited

serg06 commented Oct 30, 2020 •

edited

jugol commented Mar 22, 2021 •

edited

JeanKaddour commented Jan 9, 2022 •

edited

prabhatkumar95 commented Mar 17, 2022 •

edited

prabhatkumar95 commented Mar 17, 2022 •

edited