CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

moodoki · 2019-03-26T11:23:38Z

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, simple dummy code

import tensorflow as tf

def main():
    config = tf.ConfigProto()
    
    sess = tf.Session(config=config)

    x = tf.random.uniform([16, 64, 64, 3])
    net = x
    net = tf.layers.conv2d(net, filters=16, kernel_size=3, padding='same', strides=2)
    net = tf.layers.flatten(net)
    net = tf.layers.dense(net, 1)


    sess.run(tf.global_variables_initializer())
    
    out = sess.run(net)
    print(out)

if __name__ == '__main__':
    main()

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): Both source and compiled tested
TensorFlow version (use command below): v1.13.1-0-g6612da8951
Python version: Tested on both 3.5 and 3.6
Bazel version (if compiling from source): 0.19.2
GCC/Compiler version (if compiling from source): gcc-6 (Ubuntu 6.5.0-2ubuntu1~18.04) 6.5.0 20181026
CUDA/cuDNN version: Tested with 10.0/7.5 and 10.1/7.5
GPU model and memory: GTX 1660 TI 6GB

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
Using any convolution layers fails. Same error with all both versions of CUDA. ~/.nv directory cleared between runs. Same error in docker images as well. CUDNN verified to be working correctly with simple CUDNN programs (e.g. this)

Describe the expected behavior
Code should print an array of 16 numbers

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

2019-03-26 11:14:43.230340: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-26 11:14:43.330837: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-26 11:14:43.331452: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3d4a090 executing computations on platform CUDA. Devices:
2019-03-26 11:14:43.331472: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
2019-03-26 11:14:43.356898: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3400255000 Hz
2019-03-26 11:14:43.357161: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3db1d00 executing computations on platform Host. Devices:
2019-03-26 11:14:43.357190: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-26 11:14:43.357646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.8
pciBusID: 0000:01:00.0
totalMemory: 5.77GiB freeMemory: 5.19GiB
2019-03-26 11:14:43.357674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-26 11:14:43.358609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 11:14:43.358630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-26 11:14:43.358640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-26 11:14:43.359045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From simpletf_test/tf_convnet.py:11: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From simpletf_test/tf_convnet.py:12: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From simpletf_test/tf_convnet.py:13: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
2019-03-26 11:14:43.819308: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-26 11:14:44.479283: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-26 11:14:44.498430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv2d/Conv2D}}]]
	 [[{{node dense/BiasAdd}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "simpletf_test/tf_convnet.py", line 22, in <module>
    main()
  File "simpletf_test/tf_convnet.py", line 18, in main
    out = sess.run(net)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv2d/Conv2D (defined at simpletf_test/tf_convnet.py:11) ]]
	 [[node dense/BiasAdd (defined at simpletf_test/tf_convnet.py:13) ]]

Caused by op 'conv2d/Conv2D', defined at:
  File "simpletf_test/tf_convnet.py", line 22, in <module>
    main()
  File "simpletf_test/tf_convnet.py", line 11, in main
    net = tf.layers.conv2d(net, filters=16, kernel_size=3)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 424, in conv2d
    return layer.apply(inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 966, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 591, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 208, in __call__
    name=self.name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv2d/Conv2D (defined at simpletf_test/tf_convnet.py:11) ]]
	 [[node dense/BiasAdd (defined at simpletf_test/tf_convnet.py:13) ]]

The text was updated successfully, but these errors were encountered:

Atom-101 · 2019-03-27T05:51:48Z

I am having the same issue on RTX 2080 and Tf 2.0, #27141

ymodak · 2019-04-02T21:22:01Z

Duplicate #24496
Can you please try the solution that worked for other users in the above issue? Thanks!

moodoki · 2019-04-03T17:36:38Z

Looks like setting allow_growth=True allows the toy program to run. Interestingly, using TF1.12 in the docker container works without this setting.

Is this the way forward and is there a way to set this by default? Seems like it will require me to touch a large number of files in my unit tests...

Thanks!

ymodak · 2019-04-03T21:27:26Z

I am glad it hack worked for you. Unfortunately, Currently we don't have a feature to set allow_growth=True by default. I will close this issue since its resolved. Feel free to reopen if have any further questions. Thanks!

tensorflow-bot · 2019-04-03T21:27:28Z

Are you satisfied with the resolution of your issue?
Yes
No

sloanyyc · 2019-04-27T05:10:52Z

in python keras package modify file tensorflow_backend.py
def get_session():

add by sloan fix CUDNN_STATUS_INTERNAL_ERROR

        config.gpu_options.allow_growth = True
        _SESSION = tf.Session(config=config)

achandraa self-assigned this Mar 28, 2019

achandraa added type:bug Bug comp:keras Keras related issues labels Mar 28, 2019

achandraa assigned ymodak and unassigned achandraa Apr 2, 2019

achandraa added comp:ops OPs related issues and removed comp:keras Keras related issues labels Apr 2, 2019

ymodak added comp:gpu GPU related issues and removed comp:ops OPs related issues labels Apr 2, 2019

ymodak added the stat:awaiting response Status - Awaiting response from author label Apr 2, 2019

ymodak closed this as completed Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

moodoki commented Mar 26, 2019

Atom-101 commented Mar 27, 2019

ymodak commented Apr 2, 2019

moodoki commented Apr 3, 2019

ymodak commented Apr 3, 2019

tensorflow-bot bot commented Apr 3, 2019

sloanyyc commented Apr 27, 2019

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

Comments

moodoki commented Mar 26, 2019

Atom-101 commented Mar 27, 2019

ymodak commented Apr 2, 2019

moodoki commented Apr 3, 2019

ymodak commented Apr 3, 2019

tensorflow-bot bot commented Apr 3, 2019

sloanyyc commented Apr 27, 2019

add by sloan fix CUDNN_STATUS_INTERNAL_ERROR