Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During evaluation phase of Pascal VOC dataset with DeepLabv3/xception_65, 'Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR' error is emitted #9661

Open
ssnirgudkar opened this issue Jan 23, 2021 · 2 comments
Assignees
Labels
models:research models that come under research directory type:bug Bug in the code

Comments

@ssnirgudkar
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/research/deeplab

2. Describe the bug

As per the documentation, I am trying to run Pascal VOC dataset on DeepLabV3 and I am getting error during evaluation phase (eval.py).
The error is as follows -
INFO:tensorflow:Starting evaluation at 2021-01-23-03:43:35
I0123 03:43:35.323748 140297229682496 evaluation.py:450] Starting evaluation at 2021-01-23-03:43:35
2021-01-23 03:43:36.424183: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-23 03:43:36.907355: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-01-23 03:43:36.918801: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I do not know why this error is emitted.

3. Steps to reproduce

Comment out the call to initial train.py
and then run -
sh local_test.sh (See 'Testing the Installation' section at )

4. Expected behavior

The train.py should execute successfully without any error.

5. Additional context

I understand that this may/may not be DeepLabV3 code issue but I do not know how to fix it.
During running of 'train.py' I had faced 'out of memory' issue and I fixed it by reducing the batch size to 1. However, now, I am not even running train.py. I am only running eval.py and I can see that GPU memory usage is exceeding (watched it in different shell using nvidia-smi). How can I control the GPU memory usage? Which parameters in eval.py can be changed so that memory footprint will be manageable?

If you think that my issue is same as tensorflow/tensorflow#24496 because of NVIDIA GeForce RTX 2070 series then please let me know how to create a 'configuration object' and incorporate it in 'eval.py'.
In 'train.py' there was 'configuration object' at the beginning but in 'eval.py' there is none. And I do not know where to hook it up to if I create one!

6. System information

  • OS Platform and Distribution: Linux Ubuntu 18.04
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version (use command below): 1.15
  • Python version: 2.7
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): 7.5.0
  • CUDA/cuDNN version: CUDA: 10.2, cuDNN:7.6.5.32-1
  • GPU model and memory: GeForce RTX 2070 SUPER, 8GB
  • I have created a Docker image with base of nvidia/cuda:10.0-base-ubuntu18.04 and have build TF 1.15 in it.
@ssnirgudkar ssnirgudkar added models:research models that come under research directory type:bug Bug in the code labels Jan 23, 2021
@ssnirgudkar
Copy link
Author

Is there any update on this issue?

@yuxiazff
Copy link

you can change line 214 in eval.py as follows:

session_config = tf.ConfigProto( allow_soft_placement=True, log_device_placement=False) 
session_config.gpu_options.allow_growth = True 
contrib_training.evaluate_repeatedly( checkpoint_dir=FLAGS.checkpoint_dir, master=FLAGS.master, eval_ops=list(metrics_to_updates.values()), max_number_of_evaluations=num_eval_iters, hooks=hooks, config=session_config, eval_interval_secs=FLAGS.eval_interval_secs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models:research models that come under research directory type:bug Bug in the code
Projects
None yet
Development

No branches or pull requests

6 participants