Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1212

Closed
jmaicas opened this issue May 2, 2021 · 2 comments
Closed

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1212

jmaicas opened this issue May 2, 2021 · 2 comments

Comments

@jmaicas
Copy link

jmaicas commented May 2, 2021

Hi, I keep having the same issue that I tried to explain in #1200 (comment). I am working in Ubuntu 18.04 with a Nvidia RTX 4000

After trying the Docker and getting the same issue, I am back using the DLC-GPU environment, I still get the same problem when I try to train.

Right now I came back to install Cuda 10.0 (included in the PATH in the .bashrc file) because that is the one is used by the DLC-GPU environment when I install it in miniconda.

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

During training, I opted for doing allow_growth = True because I saw it in this github issue. tensorflow/tensorflow#24496

However, it still does not work.

deeplabcut.train_network(path_config_file, shuffle=1, allow_growth=True, gputouse=0, displayiters=50,saveiters=20000, maxiters=50000)

2021-05-02 22:16:22.447624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 7018 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 4000, pci bus id: 0000:2d:00.0, compute capability: 7.5)
2021-05-02 22:16:22.449495: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55d6db8d6fd0 executing computations on platform CUDA. Devices:
2021-05-02 22:16:22.449516: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): NVIDIA Quadro RTX 4000, Compute Capability 7.5
2021-05-02 22:16:29.122700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2021-05-02 22:16:29.122776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-02 22:16:29.122783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2021-05-02 22:16:29.122790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2021-05-02 22:16:29.122871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7018 MB memory) -> physical GPU (device: 0, name: NVIDIA Quadro RTX 4000, pci bus id: 0000:2d:00.0, compute capability: 7.5)
2021-05-02 22:16:37.593240: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-05-02 22:16:37.598426: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I also get this:

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet_v1_50/conv1/Conv2D (defined at /home/jorge/miniconda3/envs/DLC-GPU/lib/python3.7/site-packages/deeplabcut/pose_estimation_tensorflow/nnet/pose_net.py:124) ]]
	 [[node sigmoid_cross_entropy_loss/value (defined at /home/jorge/miniconda3/envs/DLC-GPU/lib/python3.7/site-packages/deeplabcut/pose_estimation_tensorflow/nnet/pose_net.py:283) ]]
@AlexEMG
Copy link
Member

AlexEMG commented May 3, 2021

This is a TensorFlow installation issue: tensorflow/tensorflow#24828

@jmaicas
Copy link
Author

jmaicas commented May 9, 2021

Thanks Alexander!

After going through the different options in the link you passed, this one worked for me!

Set the TF_FORCE_GPU_ALLOW_GROWTH environment variable to true.
In your terminal, run this command.
$ export TF_FORCE_GPU_ALLOW_GROWTH=true

Curiously enough, it did not work for me setting allow_growth to true as a python variable, but it did work setting it as a system variable.

tensorflow/tensorflow#24828 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants