Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue (?) : failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED #15

Open
ericj974 opened this issue Nov 13, 2017 · 2 comments

Comments

@ericj974
Copy link

ericj974 commented Nov 13, 2017

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 16.04.LTS
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 1.3.0
Python version: 2.7.12
CUDA/cuDNN version: 8.0/6.0.21
GPU model and memory: Nvidia Tegra X2

Describe the problem

I'm trying to run an inference using resnet50 as a feature encoder (semantic segmentation with 2 classes). Depending on my memory load, I get the following error log sooner or later:

2017-11-10 05:10:43.484563: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: Invalid reduction dimension (-1146944963 for input with 4 dimension(s)
2017-11-10 05:10:44.646881: E tensorflow/stream_executor/cuda/cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.646946: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.646975: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x30eb3d0: CUDA_ERROR_LAUNCH_FAILED
2017-11-10 05:10:44.647369: E tensorflow/stream_executor/cuda/cuda_blas.cc:551] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
2017-11-10 05:10:44.647478: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1000163558 of dimension 0 out of bounds.
2017-11-10 05:10:44.647529: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1021428837 of dimension 0 out of bounds.
2017-11-10 05:10:44.647573: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: slice index 1004492442 of dimension 0 out of bounds.

This happens whether a swapfile is being used or not. When this happens, any other inference run is impossible, even with a network with a small footprint. I'm wondering whether there is a memory issue and if yes how to deal with this ?

For info, I happen to get a similar error log when using a TX1 (compiled and binary tensorflow were tried, same os / tf configuration as above)

@LanYangXiXi
Copy link

hi eric, i just met the same problem on jetson Tx2, have you solve this?

@nvnnghia
Copy link

nvnnghia commented Apr 23, 2018

+1 @ericj974 @LanYangXiXi any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants