RuntimeError: CUDA error: an illegal memory access was encountered #21819

xiaoxiangyeyuwangye · 2019-06-15T07:55:10Z

Hi,everyone!
I met a strange illegal memory access error. It happens randomly without any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
    def __init__(self, num_points=1024):
        super(InstanceSeg, self).__init__()

        self.num_points = num_points

        self.conv1 = nn.Conv1d(9, 64, 1)
        self.conv2 = nn.Conv1d(64, 64, 1)
        self.conv3 = nn.Conv1d(64, 64, 1)
        self.conv4 = nn.Conv1d(64, 128, 1)
        self.conv5 = nn.Conv1d(128, 1024, 1)
        self.conv6 = nn.Conv1d(1088, 512, 1)
        self.conv7 = nn.Conv1d(512, 256, 1)
        self.conv8 = nn.Conv1d(256, 128, 1)
        self.conv9 = nn.Conv1d(128, 128, 1)
        self.conv10 = nn.Conv1d(128, 2, 1)
        self.max_pool = nn.MaxPool1d(num_points)

    def forward(self, x):
        batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))

        out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
        point_features = out

        out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
        out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
        global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))

        global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
        out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))

        out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
        out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
        out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
        out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))

        out = self.conv10(out) # (shape: (batch_size, 2, num_points))

        out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
        out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
        out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))

        return out

Num = 0
network = InstanceSeg()
network.cuda()
while(1):

    input0 = torch.randn(32, 3, 1024).cuda()
    input1 = torch.randn(32, 3, 1024).cuda()
    input2 = torch.randn(32, 3, 1024).cuda()
    input = torch.cat((input0, input1, input2), 1)

    out = network(input)
    Num = Num+1
    print(Num)

After random number of steps, error raises. The error report is

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
    input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered

When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this

Traceback (most recent call last):
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
    out = network(input)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
    out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.

This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2

I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.

The text was updated successfully, but these errors were encountered:

ssnl · 2019-06-15T17:21:46Z

Could be the same cudnn bug fixed in 7.6. See #16831. Could you try pytorch 1.1?

xiaoxiangyeyuwangye · 2019-06-18T00:35:37Z

@ssnl Thanks for your reply. I will do more trials and post the results here. This is really a weird error and very hard to debug.

xiaoxiangyeyuwangye · 2019-06-19T07:19:15Z

@ssnl I update the environment to pytorch 1.1, cuda 10.0, cudnn 7.6, but this error still happens.

ngimel · 2019-07-23T00:03:51Z

Can't repro with pytorch 1.1/cuda10/cudnn7.6 after more than 5000 iterations (both V100 and P100, P100 should be similar to TitanXP).

zhixuanli · 2019-10-03T07:27:06Z

Still having this problem

ptrblck · 2019-10-21T15:00:03Z

@zhixuanli are you seeing the same error using the latest PyTorch release (1.3.0)?
Could you post the setup you are using, so that we could try to reproduce this issue, since we weren't able to do so until now.

heiyuxiaokai · 2019-10-28T02:52:08Z

I met the same problem with 2080ti. Setting batch from 2 to 1 and reducing the gtBoxes of per image didn't work.
This is my environment information:

OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
GPU: 2080ti
Driver Version: 418.67
Python Version: 3.7
cuda Version: 10.1
cudnn Version: 7
pytorch Version: torch-1.1.0, torchvision-0.2.0

heiyuxiaokai · 2019-10-28T06:12:17Z

@ptrblck I tried the PyTorch(1.3.0), still having the same problem
Trian log:

out of memory
invalid argument
an illegal memory access was encountered
an illegal memory access was encountered
Traceback (most recent call last):
File "tools/train_net.py", line 174, in
main()
File "tools/train_net.py", line 167, in main
model = train(cfg, args.local_rank, args.distributed)
File "tools/train_net.py", line 73, in train
arguments,
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/engine/trainer.py", line 68, in do_train
loss_dict = model(images, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 136, in forward
return self.forward_train(anchors, box_cls, box_regression, targets)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 143, in forward_train
anchors, box_cls, box_regression, targets
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 172, in call
match_quality_matrix = boxlist_iou(targets, anchors)
File "/home/fw/Softwares/RetinaNet/maskrcnn_benchmark/structures/rboxlist_ops.py", line 167, in boxlist_iou
overlaps_th = torch.tensor(overlaps).to(boxlist1.bbox.device) #[N, M]
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb1e9515813 in /home/fw/Softwares/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)

setting CUDA_LAUNCH_BLOCKING to 1 didn't work.

jzazo · 2019-11-01T18:49:27Z

Is this problem related to this one?
I am on Ubuntu 18.04, and I have tried pytorch 1.1.0, 1.2.0, 1.3.0 and cuda's 9.2, 10.0, 10.1 with Python 3.7.4 within a conda installation. The nvidia-smi drivers I am currently using are 440.26, but I have tried a bunch as well, none working.

In my case, I get the RuntimeError: CUDA error: an illegal memory access was encountered message when I run my code on gpu 1, but it runs fine on gpu 0:

gpu=1
device = torch.device(f"cuda:{gpu}" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(device)

Any ideas on how to try debug this?

kouohhashi · 2019-11-13T09:13:08Z

@jzazo
Hi, I had similar problem.
If I use device = torch.device("cuda:1"), I always got RuntimeError: CUDA error: an illegal memory access was encountered error.

But when I set a specific gpu by torch.cuda.set_device(1), everything is fine.

dan-nadler · 2019-12-04T17:07:34Z

I'm getting this error as well, but it seems to depend on my batch size. I don't encounter it on smaller batch sizes.
pytorch v 1.3.1 on a V100

ptrblck · 2019-12-05T00:50:00Z

@heiyuxiaokai
The first output points to an "out of memory" error.
Could you lower the batch size and rerun your code again?
Are you using the code snippet from the first post or another one?

@jzazo
The original script does not use apex, so this issue should be unrelated.

@kouohhashi @dan-nadler
Are you using the script from the first post or another one?

I still cannot reproduce the error for more than 20k iterations, so I would need (another) code snippet to reproduce this issue.

dan-nadler · 2019-12-05T14:37:37Z

@ptrblck I am using a different script. Keeping the batch size down and moving the operations into functions seems to have solved it, though I'm staying around 80% GPU memory utilization. I had a handful of issues, though, so I'm not quite sure which change adressed which problem.

jzazo · 2019-12-05T17:39:26Z

I tried this MNIST example.

I added the following lines at the beginning of script:

gpu = 1
device = torch.device(gpu if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    torch.cuda.set_device(gpu)

and device = torch.device(gpu if use_cuda else "cpu") in the main function.
I get the following error: RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1570710718161/work/aten/src/THC/THCGeneral.cpp:216

It's a different error than what I was getting in my own script, but still the simple example does not run on gpu=1, but it does work on gpu=0.

I just remembered I followed this guide to move Xorg from being loaded on discrete gpu, to be run on Intel's integrated chip. Could this change be responsible for this strange behavior?
I will undo the change and report back the outcome.

jzazo · 2019-12-05T19:32:30Z

I did the rollback and it didn't fix the issue. I once more removed nvidia drivers, installed them and cuda again, and I still get the error. I don't know how to find the source of the problem.

ptrblck · 2019-12-06T00:06:52Z

@dan-nadler the peak memory usage might have caused the OOM issue.

@jzazo I cannot reproduce this issue by adding your provided code to the MNIST example on an 8 GPU system (rerunning with different GPU ids).

What GPU are you using as GPU1? If it's the intel integrated chip, this won't work.
You would need a GPU which can execute CUDA code.

jzazo · 2019-12-06T00:13:59Z

I have the Intel's integrated card and 2x Ti 1080-GTX in Ubuntu 18.04 system. When I get some time I will try narrow down the problem. I don't have a clue of what's causing it.

jdwuyao · 2019-12-22T07:40:06Z

Have you solved this problem？I met the same one recently. I can run the code correctly in a machine, but the bug arise in my own computer,even the two machine have same 2080Ti card with same diver and the same conda envirenment @xiaoxiangyeyuwangye

sicklife · 2019-12-25T09:07:15Z

same problem. ubuntu 16.04, 2080 ti Driver Version: 440.33.01 CUDA Version: 10.2

bhaeffele · 2020-01-11T03:09:06Z

I'm having a potentially related issue as well. On a machine with 8 RTX 2080 Ti GPUs, one specific GPU (4) gives the CUDA illegal memory access issue when trying to copy from the GPU to the CPU:

# predicted = pytorch tensor on GPU
predicted = predicted.view(-1).detach().cpu().numpy()
# RuntimeError: CUDA error: an illegal memory access was encountered

Identical code runs fine on the other 7 GPUs but gives an error on this particular GPU after a random number of iterations.

Driver: 430.50
Ubuntu 18.04.3 LTS
CUDA: 10.1.243
cuDNN: 7.5.1

conda install
python: 3.7.4
pytorch:  1.1.0 py3.7_cuda10.1.243_cudnn7.6.3_0
cudatoolkit: 10.1.243
torchvision:  0.4.2

I haven't done too much playing around, but this happens fairly repeatibly (usually within 20-30 minutes of running) only on this one particular GPU. Any developments about this issue before I start checking hardware?

ptrblck · 2020-01-11T04:06:11Z

@sicklife @bhaeffele Are you seeing this error using the code snippet from the first post on your setup?

knagrecha · 2020-01-18T04:59:33Z

Same problem here, happens when I try and call .to(device). CUDA 9.2, torch 0.4.0, torchvision 0.2.1.

bhaeffele · 2020-01-18T05:22:06Z

I ran the code from the first post for 1e6 iterations without any errors on my "problematic" GPU. Still getting the error with my code on that GPU only.

ptrblck · 2020-01-19T04:55:50Z

@knagrecha
0.4.0 is quite old by now. Could you please update to the latest stable release (1.4.0) and retry your script? Feel free to create a new issue in case you see the same error with any to('cuda') call and ping me there or are you seeing this error with the first posted code snippet?

@bhaeffele Could you post a (minimal) executable code snippet to reproduce this error?

hadypranoto · 2020-02-11T16:41:39Z

input0 = torch.randn(32, 3, 1024).cuda()

try this

input0 = Variable(torch.randn(32, 3, 1024).cuda())

and dont forget

from torch.autograd import Variable

ptrblck · 2020-02-11T22:42:42Z

@hadypranoto Variables were deprecated in 0.4.0, so this should not be necessary.
However, I would still recommend to update to the latest stable release and rerun the script.

doursand · 2022-06-02T09:37:52Z

same issue for me as well, it would be nice to reopen

lilhuang · 2022-06-03T13:42:35Z

Same issue for me as well; please reopen. I can fix the issue by downsizing my images (batch size was already 1), but seems to otherwise be leaking memory somehow.

doursand · 2022-06-03T14:43:54Z

in my case i have a batch size of 110 which is consuming around 14GB of GPU memory. But I go a bit above this , say 120, then I have this illegal memory access issue. And this additional 10 items are unlikely to consume the 80GB I have in total on my A100 system ...

ngimel · 2022-06-03T16:10:01Z

This issue won't be reopened #21819 (comment)

dongZheX · 2022-10-11T08:06:15Z

Just update the cuda version to 11.3 and the pytorch version to the lateset stabal vesrion. My problem disppears | | 董喆 | | ***@***.*** | 签名由网易邮箱大师定制 On 3/12/2022 ***@***.***> wrote： For me, I just used Tensos.contiguous().cuda() before feeding it to the model and this problem got fixed. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: ***@***.***>

bknyaz · 2022-11-18T18:16:03Z

I solved this issue by either:

replacing 1x1 nn.Conv2d with nn.Linear
updating nvidia driver to 515 and cuda to 11.7

so probably older cuda had some bug in the convolution code

netw0rkf10w · 2023-02-11T12:09:31Z

Still having this issue with PyTorch 2.0, CUDA 11.7, and Nvidia driver 525.60.13. (@bknyaz I'm using 1x1 nn.Conv2d as well, not sure if this is the cause.)

kellenf · 2023-02-23T07:19:55Z

I met this problem when increase batchsize, and the error always concur in nn.MaxPool1

Minqi824 · 2023-02-24T00:59:33Z

Upgrading the torch version may be a solution. I solve this problem when upgrading torch==1.8.1 to torch==1.9.0.

pvtien96 · 2023-02-25T15:06:42Z

Hi guys,
I'm trying to train a mobilenetv2 model and meet the same error. Environment: pytorch 1.13, cuda_11.7.
Any recommendation is appreciated, thanks in advance!

Ramoif · 2023-02-26T09:59:26Z

I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code.
I don't think it's the problem of batchsize, I reduced the size but still haven't solved it.
The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32).
Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(

# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

weijiangfaire · 2023-03-05T08:29:59Z

This has been an issue for me for a while. After updating to nightly (or maybe just pytorch-cuda version issue), it is all good for "ddp" training.

OS: AWS sagemaker ml.p2.8xlarge
GPU: Tesla K80 * 8
pytorch-cuda -> 11.8
pytorch-nightly -> 2.1.0.dev20230304
pytorch_lightning -> 1.94

871699406 · 2023-05-10T05:41:10Z

same error on 'cuda:1'
solution:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

hannan72 · 2023-06-06T06:51:10Z

I have tried to upgrade the version (pytorch2.0、cuda_11.7 => 11.8, and I still met this problem in two models training code. I don't think it's the problem of batchsize, I reduced the size but still haven't solved it. The last line before the error is images = torch.tensor(images, device=self.device, dtype=torch.float32). Another strange thing is that when this error occurs, the brightness of my laptop monitor drops to the lowest, I don't know what happened :(
# the Last 3 lines in terminal
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

This is because the GPU utilization remains 100% after the CUDA error and does not drops, so the GPU sinks a lot of power and while laptop power is not very powerful, it results in power degradation of other devices/peripherals.

GuintherKovalski · 2023-07-19T11:58:17Z

We where facing this problem in inference time after hundreds of iterations.
After a lot of tests and finally finding some configuration that solved the issue,
it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:

OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04

and it was solved by changing to this configuration:

OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

ASRodrigo1 · 2023-07-19T13:39:38Z

We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:
OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04
and it was solved by changing to this configuration:
OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

I was having the same problem while trying to run multiple models in parallel on a docker image (nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04)
@GuintherKovalski solution worked for me!

darouwan · 2023-07-20T10:41:42Z

We where facing this problem in inference time after hundreds of iterations. After a lot of tests and finally finding some configuration that solved the issue, it is very likely that what causes this is some combination of cuda+pytorch version

the error appears in this configuration:
OS:     AWS sagemaker ml.g4dn.xlarge
GPU:    tesla t4
CUDA:   11.8.0
PYTORCH: 
	torch==2.0.0
	torchvision==0.15.1
IMAGE:  nvidia/cuda:11.8.0-base-ubuntu20.04
and it was solved by changing to this configuration:
OS:      AWS sagemaker ml.g4dn.xlarge
GPU:     tesla t4
CUDA:    11.3.1
PYTORCH: 
	 torch==1.12.1+cu113
	 torchvision==0.13.1+cu113
IMAGE:   nvidia/cuda:11.3.1-runtime-ubuntu20.04

SO it looks like a campatible bug?

DXSheep · 2023-08-24T11:57:24Z

same error on 'cuda:1' solution: import os os.environ["CUDA_VISIBLE_DEVICES"] = "1"

This solved my problem

jimmy-dq · 2024-01-04T08:01:26Z

In my case, it is just because the tensor output from the neural network is not contiguous, I add .contiguous() in the output tensor and everything is fine.

SHEN2BAIYI · 2024-01-23T03:29:39Z

Hi all, in my case, I just change my batch size.

ecilay · 2024-01-23T21:27:56Z

I solved by CUDA_LAUNCH_BLOCKING=1.

YoucanBaby · 2024-04-01T02:10:27Z

I got same error used torch==2.2.0.
When I use torch==2.0.0, this error never appears again.

momongah · 2024-04-12T16:02:26Z

I got same error used torch==2.2.0. When I use torch==2.0.0, this error never appears again.

This worked for me as well, thanks.

umanwizard added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 17, 2019

mrshenli mentioned this issue Jun 21, 2019

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR in 1.1.0 #22050

Closed

Emrys365 mentioned this issue Nov 12, 2019

Fixing compatibility problems with PyTorch 1.3.0 in ESPnet (v0.5.3) espnet/espnet#1343

Merged

albertz mentioned this issue Sep 13, 2022

failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered rwth-i6/returnn#1124

Closed

NatalieC323 mentioned this issue Mar 21, 2023

[BUG]: Dreambooth inference hpcaitech/ColossalAI#3145

Open

yellowjs0304 mentioned this issue Mar 29, 2023

RuntimeError: CUDA error: an illegal memory access was encountered when training cfa open-mmlab/mmrotate#614

Closed

3 tasks

lesterphillip mentioned this issue Apr 29, 2023

CUDA illegal memory access error on evaluation lesterphillip/SVCC23_FastSVC#8

Closed

SunMarc mentioned this issue Oct 19, 2023

RWKV v4 not working with device_map auto and 4 GPUs huggingface/transformers#26544

Closed

4 tasks

tsunghan-wu mentioned this issue Feb 27, 2024

"RuntimeError: CUDA error: an illegal memory access was encountered" when trying to run the plain demo. tsunghan-wu/SLD#3

Open

kkvtran mentioned this issue May 8, 2024

AnalogConv2d fails when using TT-v2 IBM/aihwkit#642

Closed

RuntimeError: CUDA error: an illegal memory access was encountered #21819

RuntimeError: CUDA error: an illegal memory access was encountered #21819

Comments

xiaoxiangyeyuwangye commented Jun 15, 2019 • edited

ssnl commented Jun 15, 2019

xiaoxiangyeyuwangye commented Jun 18, 2019 • edited

xiaoxiangyeyuwangye commented Jun 19, 2019

ngimel commented Jul 23, 2019

zhixuanli commented Oct 3, 2019

ptrblck commented Oct 21, 2019

heiyuxiaokai commented Oct 28, 2019 • edited

heiyuxiaokai commented Oct 28, 2019 • edited

jzazo commented Nov 1, 2019

kouohhashi commented Nov 13, 2019

dan-nadler commented Dec 4, 2019

ptrblck commented Dec 5, 2019

dan-nadler commented Dec 5, 2019

jzazo commented Dec 5, 2019

jzazo commented Dec 5, 2019

ptrblck commented Dec 6, 2019

jzazo commented Dec 6, 2019 • edited

jdwuyao commented Dec 22, 2019

sicklife commented Dec 25, 2019

bhaeffele commented Jan 11, 2020

ptrblck commented Jan 11, 2020

knagrecha commented Jan 18, 2020

bhaeffele commented Jan 18, 2020

ptrblck commented Jan 19, 2020

hadypranoto commented Feb 11, 2020

ptrblck commented Feb 11, 2020

doursand commented Jun 2, 2022

lilhuang commented Jun 3, 2022

doursand commented Jun 3, 2022

ngimel commented Jun 3, 2022

dongZheX commented Oct 11, 2022 via email

bknyaz commented Nov 18, 2022

netw0rkf10w commented Feb 11, 2023

kellenf commented Feb 23, 2023

Minqi824 commented Feb 24, 2023

pvtien96 commented Feb 25, 2023

Ramoif commented Feb 26, 2023

weijiangfaire commented Mar 5, 2023 • edited

871699406 commented May 10, 2023 • edited

hannan72 commented Jun 6, 2023

GuintherKovalski commented Jul 19, 2023 • edited

ASRodrigo1 commented Jul 19, 2023

darouwan commented Jul 20, 2023

DXSheep commented Aug 24, 2023

jimmy-dq commented Jan 4, 2024 • edited

SHEN2BAIYI commented Jan 23, 2024

ecilay commented Jan 23, 2024 • edited

YoucanBaby commented Apr 1, 2024

momongah commented Apr 12, 2024

xiaoxiangyeyuwangye commented Jun 15, 2019 •

edited

xiaoxiangyeyuwangye commented Jun 18, 2019 •

edited

heiyuxiaokai commented Oct 28, 2019 •

edited

heiyuxiaokai commented Oct 28, 2019 •

edited

jzazo commented Dec 6, 2019 •

edited

weijiangfaire commented Mar 5, 2023 •

edited

871699406 commented May 10, 2023 •

edited

GuintherKovalski commented Jul 19, 2023 •

edited

jimmy-dq commented Jan 4, 2024 •

edited

ecilay commented Jan 23, 2024 •

edited