-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837
Comments
Can you provide a minimal code example to reproduce? Please also copy and paste the output from our environment collection script. You can get the script and run it with:
|
Hello @vincentqb, Code example: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip Output from the environment collection script:
|
@alexeygolyshev is that a minimal example? Looks like there is a lot of code in there. |
Hello @albanD, |
@albanD My inputs: [sentences, words, characters]. I have 2 varying dimensions: different number of words in a sentence and different number of characters in a word. |
Unfortunately I don't have a setup with notebook available. Could you run your code with |
Hello, my computer system is the same as yours,【Win10(1903),Python 3.7.4, RTX 2060 (driver version 441.20),torch.version==1.2.0,】 you say No problem in PyTorch 1.2. Can you tell me all the information in this version? |
Hello @JYH9351,
Crashes less frequently, not in the first 2 epochs. |
Does switching off the TDR settings helps? https://zhuanlan.zhihu.com/p/38141415 |
No. TDR = 60. Run 2 times. Crashed in epochs 2 and 11. This error appears randomly.
|
I have to say that it is difficult to say where the problem is without the stacktrace including the exact crash site. But we may get that with the help of a RelWithDebInfo build and the attachment of the VS debugger. I could build one for you if you have trouble in building the project. |
It will be great if you can prepare the debug build. I don't have much experience. |
Interesting. |
I had this issue training a model from https://github.com/wgrathwohl/JEM with PyTorch 1.3
The error happened seemingly randomly in the middle of training. I am using linux mint, not Windows. |
I will suggested that you try again with uninstalling GPU driver with DDU and installing the driver that comes with cuda toolkit. Too many bugs with Nvidia GPU driver on win 10. |
I have run into this same issue and tried the suggestion of @kice of installing the driver from the cuda toolkit with no luck. |
I am running into similar issues on my windows machine, I have a simple pipeline for binary classification with an LSTM and it shuts down at epochs (seems to be random). |
My issue is also with lstm. Interestingly when I add |
Same problem with LSTM + binary classification + error in random epoch on windows 10 + Pytorch 1.4
update: |
@shingyipcheung Are you able to replicate the error with torch.autograd.set_detect_anomaly(True) set in order to get a full stacktrace? |
I'm having this issue as well! (EDIT: on latest 1.4). The network will train for awhile, then at some random point, the classifier will halt with this exception. It is possible to reproduce by using FastAI AWD-LSTM transfer learning for text classification on a very large dataset: https://docs.fast.ai/text.html After this happens, further CUDA operations result in the same error until the kernel is restarted. I suspect a lot of this simply does not get tested on Windows. Professionally, I always use Linux for machine learning tasks. It just so happens that my only personal system with a GPU runs Windows and does not have space for a Linux install. Furthermore, "Ubuntu on Windows" does not support CUDA. |
I have the same issue when I train with LSTM + classification, the error occurs in random epoch on windows 10 + Pytorch 1.4 |
Exception has occurred: RuntimeError |
Same error on windows, training an LSTM on a GTX 2080 TI. Happens with both Pytorch 1.5 and 1.6. |
Switching from 1.6 to 1.5 and downgrading my Nvidia driver to 431.86 fixed the error for me. |
Same error while training an LSTM with a big batch size on windows, I was getting random crashes after 1 to 20 epochs. Setting Pytorch 1.5.1 |
@lucas-emery, did you try extend the TDR display or disable TDR as https://developer.download.nvidia.com/NsightVisualStudio/2.2/Documentation/UserGuide/HTML/Content/Timeout_Detection_Recovery.htm |
@mszhanyi i did try extending the TDR to 60 seconds. I was able to run a 13 hour training session after setting the TDR and restarting my pc, but the backprop time was also faster (down from 1 min to 10/15 secs), I guess it was just a coincidence and cudnn chose a different algorithm. |
@lucas-emery , could you provide a simplified script that I could reproduce it? |
@mszhanyi I'm afraid it won't be possible, it's a very complex model on a reinforcement learning task. I'll let you know if I find anything else. I'll try to get something reproducible after I finish. |
I'm getting this issue on my RTX 3080, and I can't even downgrade PyTorch because older versions don't support RTX 3000. These two fixes worked for me, but both have a performance penalty:
|
Same issue on These fixes worked for me,too.
|
We are facing the same issue. Tried on Ubuntu 18.04, Nvidia K80, M60, V100, all with the same pytorch version Applying the below fix doesn't help as well.... :(
|
Facing same error on 2080Ti(sm_75)+Windows10+CUDA11.1, not using Pytorch. |
Same issue in GRU + pytorch1.8 + single thread + Cuda11.1 + Windows10 + RTX3090 |
Fixed in PyTorch 1.9.0 (Windows 10, CUDA 10.2, RTX 2060) |
@JeanKaddour Were you able to get to the bottom of it. I seem to be observing similar issues with Pytorch |
I also have this issue. Pytorch 1.9.0 + CUDA10.2 + Python3.7 + Ubuntu 18.04 |
Have the same issue on Pytorch (1.10/1.11/1.12(source build)) + Ubuntu 20.04 + Python 3.8/Python 3.9 + CUDA 11.2/11.6 A6000 / RTX 3090 GPU @akashsharma02 did you find any solution? |
Hello @prabhatkumar95, I can't reproduce the error from my first post in this thread. And the speed is good (on the same hardware): 6 seconds per epoch now vs 24 seconds 2 years ago. Windows 10, Python 3.10.0, PyTorch 1.11, CUDA 11.3.1, RTX 2060 But I reopened the issue on your request. |
Hi @alexeygolyshev thanks, this issue now is as per my guess CPU dependent with Intel CPUs running as normal but AMD having the issue. When tried with both RTX 3090 and A6000. As the error is the same I wanted to keep everything in the same thread. Issue reported is here |
Closing, as there is the new issue |
🐛 Bug
No problem in PyTorch 1.2. Archive with code and data: https://github.com/pytorch/pytorch/files/3723821/PyTorch.zip
Windows 10 (1903), Python 3.7.4, RTX 2060 (driver version 436.48)
cc @ezyang @gchanan @zou3519 @ssnl @albanD @gqchen @ngimel @peterjc123
The text was updated successfully, but these errors were encountered: