preds are nan #28

zhangzaibin · 2023-05-26T01:38:20Z

Thanks for your great work. I have a issue. In stage2, my preds are nan at the start of training and it turns out error. Have you ever encounted this problem?
I train using VoxFormer-T

KSonPham · 2023-06-10T10:37:53Z

Me too have this problem

RoboticsYimingLi · 2023-06-11T03:04:21Z

Varying machines exhibit different behaviours. Can you attempt multiple tries?

KSonPham · 2023-06-11T06:45:30Z

Yes, for me the problem goes away when i set worker to 0 (not always the case) or run in a docker environment (no error what soever). Another problem is setting large number of worker such as 4 (default) filled up my 32 gb memory.

ziming-liu · 2023-08-27T20:53:31Z

it is a CUDA memory error? what(): CUDA error: an illegal memory access was encountered

willemeng · 2023-11-07T08:43:25Z

我也遇到了相似的问题，：
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误，一旦我在远程服务器终端运行时就会出现这个错误，但也有极少数时候可以正常运行

zzk785089755 · 2024-01-29T12:27:35Z

我也遇到了相似的问题，： RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误，一旦我在远程服务器终端运行时就会出现这个错误，但也有极少数时候可以正常运行

I also encountered this issue. Deleting the ./VoxFormer/deform_attn_3d directory and re-uploading it resolved the issue. I'm curious about the reason and hope the author can provide an explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preds are nan #28

preds are nan #28

zhangzaibin commented May 26, 2023

KSonPham commented Jun 10, 2023

RoboticsYimingLi commented Jun 11, 2023

KSonPham commented Jun 11, 2023

ziming-liu commented Aug 27, 2023

willemeng commented Nov 7, 2023

zzk785089755 commented Jan 29, 2024 •

edited

preds are nan #28

preds are nan #28

Comments

zhangzaibin commented May 26, 2023

KSonPham commented Jun 10, 2023

RoboticsYimingLi commented Jun 11, 2023

KSonPham commented Jun 11, 2023

ziming-liu commented Aug 27, 2023

willemeng commented Nov 7, 2023

zzk785089755 commented Jan 29, 2024 • edited

zzk785089755 commented Jan 29, 2024 •

edited