Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preds are nan #28

Open
zhangzaibin opened this issue May 26, 2023 · 6 comments
Open

preds are nan #28

zhangzaibin opened this issue May 26, 2023 · 6 comments

Comments

@zhangzaibin
Copy link

Thanks for your great work. I have a issue. In stage2, my preds are nan at the start of training and it turns out error. Have you ever encounted this problem?
I train using VoxFormer-T

@KSonPham
Copy link

Me too have this problem

@RoboticsYimingLi
Copy link
Contributor

Varying machines exhibit different behaviours. Can you attempt multiple tries?

@KSonPham
Copy link

Yes, for me the problem goes away when i set worker to 0 (not always the case) or run in a docker environment (no error what soever). Another problem is setting large number of worker such as 4 (default) filled up my 32 gb memory.

@ziming-liu
Copy link

it is a CUDA memory error? what(): CUDA error: an illegal memory access was encountered

@willemeng
Copy link

我也遇到了相似的问题,:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行

@zzk785089755
Copy link

zzk785089755 commented Jan 29, 2024

我也遇到了相似的问题,: RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fbabb853a22 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x10aa3 (0x7fbac4010aa3 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fbac4012147 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fbabb83d5a4 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: + 0xa2822a (0x7fb952a2822a in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #5: + 0xa282c1 (0x7fb952a282c1 in /data/B221000559-XYJ/.conda/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #25: + 0x29d90 (0x7fbaeb029d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7fbaeb029e40 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)

很奇怪的是我在远程debug时不会出现该错误,一旦我在远程服务器终端运行时就会出现这个错误,但也有极少数时候可以正常运行

I also encountered this issue. Deleting the ./VoxFormer/deform_attn_3d directory and re-uploading it resolved the issue. I'm curious about the reason and hope the author can provide an explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants