During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

ssnirgudkar · 2021-01-23T04:13:40Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/deeplab

2. Describe the bug

As per the documentation, I am trying to run Pascal VOC dataset on DeepLabV3 and I am getting error during evaluation phase (eval.py).
The error is as follows -
INFO:tensorflow:Starting evaluation at 2021-01-23-03:43:35
I0123 03:43:35.323748 140297229682496 evaluation.py:450] Starting evaluation at 2021-01-23-03:43:35
2021-01-23 03:43:36.424183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-23 03:43:36.907355: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-23 03:43:36.918801: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I do not know why this error is emitted.

3. Steps to reproduce

Comment out the call to initial train.py
and then run -
sh local_test.sh (See 'Testing the Installation' section at )

4. Expected behavior

The train.py should execute successfully without any error.

5. Additional context

I understand that this may/may not be DeepLabV3 code issue but I do not know how to fix it.
During running of 'train.py' I had faced 'out of memory' issue and I fixed it by reducing the batch size to 1. However, now, I am not even running train.py. I am only running eval.py and I can see that GPU memory usage is exceeding (watched it in different shell using nvidia-smi). How can I control the GPU memory usage? Which parameters in eval.py can be changed so that memory footprint will be manageable?

If you think that my issue is same as tensorflow/tensorflow#24496 because of NVIDIA GeForce RTX 2070 series then please let me know how to create a 'configuration object' and incorporate it in 'eval.py'.
In 'train.py' there was 'configuration object' at the beginning but in 'eval.py' there is none. And I do not know where to hook it up to if I create one!

6. System information

OS Platform and Distribution: Linux Ubuntu 18.04
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): 1.15
Python version: 2.7
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: CUDA: 10.2, cuDNN:7.6.5.32-1
GPU model and memory: GeForce RTX 2070 SUPER, 8GB
I have created a Docker image with base of nvidia/cuda:10.0-base-ubuntu18.04 and have build TF 1.15 in it.

ssnirgudkar · 2021-01-26T03:29:25Z

Is there any update on this issue?

yuxiazff · 2022-01-11T12:29:06Z

you can change line 214 in eval.py as follows:

session_config = tf.ConfigProto( allow_soft_placement=True, log_device_placement=False) 
session_config.gpu_options.allow_growth = True 
contrib_training.evaluate_repeatedly( checkpoint_dir=FLAGS.checkpoint_dir, master=FLAGS.master, eval_ops=list(metrics_to_updates.values()), max_number_of_evaluations=num_eval_iters, hooks=hooks, config=session_config, eval_interval_secs=FLAGS.eval_interval_secs)

ssnirgudkar added models:research models that come under research directory type:bug Bug in the code labels Jan 23, 2021

google-ml-butler bot assigned saikumarchalla Jan 23, 2021

saikumarchalla assigned gpapan, YknZhu and aquariusjay and unassigned saikumarchalla Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

ssnirgudkar commented Jan 23, 2021

ssnirgudkar commented Jan 26, 2021

yuxiazff commented Jan 11, 2022

During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

Comments

ssnirgudkar commented Jan 23, 2021

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

ssnirgudkar commented Jan 26, 2021

yuxiazff commented Jan 11, 2022