Calling `spspmm` twice gives `CUDA error: an illegal memory access was encountered` #174

patmjen · 2021-09-22T13:27:08Z

Summary

Running spspmm two times with the same inputs gives RuntimeError: CUDA error: an illegal memory access was encountered.

The following snippet shows the issue for me:

import torch
from torch_sparse import spspmm

# device = torch.device('cpu')  # This works
device = torch.device('cuda')  # This will error

# Make two simple sparse matrices
A_idx = torch.tensor([[0, 1], [0, 1]])
A_val = torch.tensor([1, 1]).float()

B_idx = torch.tensor([[0, 0, 1], [0, 1, 1]])
B_val = torch.tensor([2, 3, 4]).float()

# Transfer to device
print(f'To {device}')
A_idx = A_idx.to(device)
A_val = A_val.to(device)
B_idx = B_idx.to(device)
B_val = B_val.to(device)

# Do matrix multiplies
print('spspmm 1')
spspmm(A_idx, A_val, B_idx, B_val, 2, 2, 2, coalesced=True)  # This works
print('spspmm 2')
spspmm(A_idx, A_val, B_idx, B_val, 2, 2, 2, coalesced=True)  # On CUDA, this errors

When I run the above code, I get the following error:

To cuda
spspmm 1
spspmm 2
Traceback (most recent call last):
  File "sparsebug.py", line 25, in <module>
    spspmm(A_idx, A_val, B_idx, B_val, 2, 2, 2, )  # On CUDA, this errors
  File "venv/lib/python3.8/site-packages/torch_sparse/spspmm.py", line 30, in spspmm
    C = matmul(A, B)
  File "venv/lib/python3.8/site-packages/torch_sparse/matmul.py", line 139, in matmul
    return spspmm(src, other, reduce)
  File "venv/lib/python3.8/site-packages/torch_sparse/matmul.py", line 116, in spspmm
    return spspmm_sum(src, other)
  File "venv/lib/python3.8/site-packages/torch_sparse/matmul.py", line 101, in spspmm_sum
    rowptrC, colC, valueC = torch.ops.torch_sparse.spspmm_sum(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Sorry if this just me using the library wrongly! Is there something I should be doing in between calls to spspmm? Or any other way to fix it?

Environment

$ python collect_env.py
Collecting environment information...
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Scientific Linux release 7.7 (Nitrogen) (x86_64)
GCC version: (GCC) 8.3.0
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.8.4 (default, Jul 16 2020, 09:01:13)  [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.1.74
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB

Nvidia driver version: 470.42.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] pytorch3d==0.5.0
[pip3] torch==1.9.0+cu111
[pip3] torch-scatter==2.0.8
[pip3] torch-sparse==0.6.12
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

rusty1s · 2021-09-23T07:15:02Z

Weird, it works for me, using CUDA 11.1. Does running with CUDA_LAUNCH_BLOCKING=1 give you a more reasonable error message? Is it possible for you to determine which call in spspmm_cuda.cu accesses illegal memory?

patmjen · 2021-09-23T11:22:50Z

Unfortunately no, adding CUDA_LAUNCH_BLOCKING=1 does not change the error (except that it doesn't suggest using CUDA_LAUNCH_BLOCKING=1 now).

Is there a way I could determine what call accesses illegal memory without recompiling etc.? I suspect no, but no harm in asking.

What graphics card are you using? I once had to deal with a bug that only showed up on newer cards (despite using same CUDA version), since they had updated how some illegal operations were handled. On the old cards, the illegal operation was ignored (so I did not discover it), but not on the newer ones which caused the bug to pop up there. Maybe it's something similar here?

rusty1s · 2021-09-24T07:31:47Z

I think you have to re-compile to perform some further debugging. I have tested it on 1080Ti, 2080Ti and Titan RTX and they all work fine.

@JiaxuanYou, @RexYing: Can you also check if you can reproduce this issue?

patmjen · 2021-09-24T09:33:29Z

I also just tested it on an NVIDIA GeForce RTX 2070 Super card on my Windows 10 machine. Here, the bug does not show up. So maybe it is dependent on the card.

Unfortunately, I don't have time to do further debugging in the near future. Sorry! I know this makes it hard to proceed, so if you want you can close the issue.

rusty1s · 2021-09-27T06:12:52Z

Thanks for reporting. I'm still leaving this issue open. If someone else has the same problem and is willing to debug, we can hopefully fix this.

thijssnelleman · 2022-04-04T10:56:13Z

Anybody still working on this? Ran into the same issue whilst deploying Graph-UNET, which relies on spspmm. Could perhaps try and debug.

rusty1s · 2022-04-04T14:43:10Z

It would be of much help if you can try to debug :)

* disentangle config from HeteroPostLayer * update * update * update * add doc * fix imports * update test * fix API test * fix test * fix tests * add doc * MLPNodeHead * update pyg * reset * resolve comments * update doc * typo * typo * fix test * update

daeunni · 2022-04-30T10:08:08Z

I felt the same error. can anyone address this issue?

rusty1s · 2022-05-02T06:43:04Z

Does this mean that #228 is resolved for you?

andreimargeloiu · 2022-12-18T15:19:29Z

@thijssnelleman how did you solve the issue?

thijssnelleman · 2022-12-19T13:20:56Z

I believe I replaced the layer that made use of this function with another layer.. Not much of a solution but worked in my situation.

rusty1s added bug Something isn't working help wanted Extra attention is needed labels Sep 27, 2021

rusty1s mentioned this issue Dec 11, 2021

spspmm raises error in cuda but works well in cpu pyg-team/pytorch_geometric#3097

Open

rusty1s mentioned this issue Jan 8, 2022

AssertionError in matmul (assert int(col.max()) < N) #191

Closed

rusty1s mentioned this issue Jul 27, 2022

Works on cpu but not on cuda pyg-team/pytorch_geometric#5055

Closed

GooLiang mentioned this issue Feb 28, 2023

spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling `spspmm` twice gives `CUDA error: an illegal memory access was encountered` #174

Calling `spspmm` twice gives `CUDA error: an illegal memory access was encountered` #174

patmjen commented Sep 22, 2021

rusty1s commented Sep 23, 2021

patmjen commented Sep 23, 2021 •

edited

rusty1s commented Sep 24, 2021

patmjen commented Sep 24, 2021 •

edited

rusty1s commented Sep 27, 2021

thijssnelleman commented Apr 4, 2022

rusty1s commented Apr 4, 2022

daeunni commented Apr 30, 2022

rusty1s commented May 2, 2022

andreimargeloiu commented Dec 18, 2022

thijssnelleman commented Dec 19, 2022

Calling spspmm twice gives CUDA error: an illegal memory access was encountered #174

Calling spspmm twice gives CUDA error: an illegal memory access was encountered #174

Comments

patmjen commented Sep 22, 2021

Summary

Environment

rusty1s commented Sep 23, 2021

patmjen commented Sep 23, 2021 • edited

rusty1s commented Sep 24, 2021

patmjen commented Sep 24, 2021 • edited

rusty1s commented Sep 27, 2021

thijssnelleman commented Apr 4, 2022

rusty1s commented Apr 4, 2022

daeunni commented Apr 30, 2022

rusty1s commented May 2, 2022

andreimargeloiu commented Dec 18, 2022

thijssnelleman commented Dec 19, 2022

Calling `spspmm` twice gives `CUDA error: an illegal memory access was encountered` #174

Calling `spspmm` twice gives `CUDA error: an illegal memory access was encountered` #174

patmjen commented Sep 23, 2021 •

edited

patmjen commented Sep 24, 2021 •

edited