-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: an illegal memory access was encountered #21819
Comments
Could be the same cudnn bug fixed in 7.6. See #16831. Could you try pytorch 1.1? |
@ssnl Thanks for your reply. I will do more trials and post the results here. This is really a weird error and very hard to debug. |
@ssnl I update the environment to pytorch 1.1, cuda 10.0, cudnn 7.6, but this error still happens. |
Can't repro with pytorch 1.1/cuda10/cudnn7.6 after more than 5000 iterations (both V100 and P100, P100 should be similar to TitanXP). |
Still having this problem |
@zhixuanli are you seeing the same error using the latest PyTorch release (1.3.0)? |
I met the same problem with 2080ti. Setting batch from 2 to 1 and reducing the gtBoxes of per image didn't work.
|
@ptrblck I tried the PyTorch(1.3.0), still having the same problem
setting |
Is this problem related to this one? In my case, I get the
Any ideas on how to try debug this? |
@jzazo But when I set a specific gpu by |
I'm getting this error as well, but it seems to depend on my batch size. I don't encounter it on smaller batch sizes. |
@heiyuxiaokai @jzazo @kouohhashi @dan-nadler I still cannot reproduce the error for more than 20k iterations, so I would need (another) code snippet to reproduce this issue. |
@ptrblck I am using a different script. Keeping the batch size down and moving the operations into functions seems to have solved it, though I'm staying around 80% GPU memory utilization. I had a handful of issues, though, so I'm not quite sure which change adressed which problem. |
I tried this MNIST example. I added the following lines at the beginning of script:
and It's a different error than what I was getting in my own script, but still the simple example does not run on I just remembered I followed this guide to move Xorg from being loaded on discrete gpu, to be run on Intel's integrated chip. Could this change be responsible for this strange behavior? |
I did the rollback and it didn't fix the issue. I once more removed nvidia drivers, installed them and cuda again, and I still get the error. I don't know how to find the source of the problem. |
@dan-nadler the peak memory usage might have caused the OOM issue. @jzazo I cannot reproduce this issue by adding your provided code to the MNIST example on an 8 GPU system (rerunning with different GPU ids). What GPU are you using as GPU1? If it's the intel integrated chip, this won't work. |
I have the Intel's integrated card and 2x Ti 1080-GTX in Ubuntu 18.04 system. When I get some time I will try narrow down the problem. I don't have a clue of what's causing it. |
Have you solved this problem?I met the same one recently. I can run the code correctly in a machine, but the bug arise in my own computer,even the two machine have same 2080Ti card with same diver and the same conda envirenment @xiaoxiangyeyuwangye |
same problem. ubuntu 16.04, 2080 ti Driver Version: 440.33.01 CUDA Version: 10.2 |
I'm having a potentially related issue as well. On a machine with 8 RTX 2080 Ti GPUs, one specific GPU (4) gives the CUDA illegal memory access issue when trying to copy from the GPU to the CPU:
Identical code runs fine on the other 7 GPUs but gives an error on this particular GPU after a random number of iterations.
I haven't done too much playing around, but this happens fairly repeatibly (usually within 20-30 minutes of running) only on this one particular GPU. Any developments about this issue before I start checking hardware? |
@sicklife @bhaeffele Are you seeing this error using the code snippet from the first post on your setup? |
Same problem here, happens when I try and call .to(device). CUDA 9.2, torch 0.4.0, torchvision 0.2.1. |
I ran the code from the first post for 1e6 iterations without any errors on my "problematic" GPU. Still getting the error with my code on that GPU only. |
@knagrecha @bhaeffele Could you post a (minimal) executable code snippet to reproduce this error? |
try this input0 = Variable(torch.randn(32, 3, 1024).cuda()) and dont forget from torch.autograd import Variable |
@hadypranoto |
same issue for me as well, it would be nice to reopen |
Same issue for me as well; please reopen. I can fix the issue by downsizing my images (batch size was already 1), but seems to otherwise be leaking memory somehow. |
in my case i have a batch size of 110 which is consuming around 14GB of GPU memory. But I go a bit above this , say 120, then I have this illegal memory access issue. And this additional 10 items are unlikely to consume the 80GB I have in total on my A100 system ... |
This issue won't be reopened #21819 (comment) |
Just update the cuda version to 11.3 and the pytorch version to the lateset stabal vesrion. My problem disppears
| |
董喆
|
|
***@***.***
|
签名由网易邮箱大师定制
On 3/12/2022 ***@***.***> wrote:
For me, I just used Tensos.contiguous().cuda() before feeding it to the model and this problem got fixed.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you commented.Message ID: ***@***.***>
|
I solved this issue by either:
so probably older cuda had some bug in the convolution code |
Still having this issue with PyTorch 2.0, CUDA 11.7, and Nvidia driver 525.60.13. (@bknyaz I'm using 1x1 nn.Conv2d as well, not sure if this is the cause.) |
I met this problem when increase batchsize, and the error always concur in nn.MaxPool1 |
Upgrading the torch version may be a solution. I solve this problem when upgrading torch==1.8.1 to torch==1.9.0. |
Hi guys, |
I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code. # the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1 |
This has been an issue for me for a while. After updating to nightly (or maybe just pytorch-cuda version issue), it is all good for "ddp" training. OS: AWS sagemaker ml.p2.8xlarge |
same error on 'cuda:1' |
This is because the GPU utilization remains 100% after the CUDA error and does not drops, so the GPU sinks a lot of power and while laptop power is not very powerful, it results in power degradation of other devices/peripherals. |
We where facing this problem in inference time after hundreds of iterations. the error appears in this configuration:
and it was solved by changing to this configuration:
|
I was having the same problem while trying to run multiple models in parallel on a docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04) |
SO it looks like a campatible bug? |
This solved my problem |
In my case, it is just because the tensor output from the neural network is not contiguous, I add .contiguous() in the output tensor and everything is fine. |
Hi all, in my case, I just change my batch size. |
I solved by |
I got same error used torch==2.2.0. |
This worked for me as well, thanks. |
Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.
After random number of steps, error raises. The error report is
When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this
I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.
This is my environment information:
I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.
The text was updated successfully, but these errors were encountered: