Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

Closed
moodoki opened this issue Mar 26, 2019 · 6 comments
Closed

CUDNN_STATUS_INTERNAL_ERROR on GTX 1660 TI #27144

moodoki opened this issue Mar 26, 2019 · 6 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author type:bug Bug

Comments

@moodoki
Copy link

moodoki commented Mar 26, 2019

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, simple dummy code
import tensorflow as tf

def main():
    config = tf.ConfigProto()
    
    sess = tf.Session(config=config)

    x = tf.random.uniform([16, 64, 64, 3])
    net = x
    net = tf.layers.conv2d(net, filters=16, kernel_size=3, padding='same', strides=2)
    net = tf.layers.flatten(net)
    net = tf.layers.dense(net, 1)


    sess.run(tf.global_variables_initializer())
    
    out = sess.run(net)
    print(out)

if __name__ == '__main__':
    main()
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • TensorFlow installed from (source or binary): Both source and compiled tested
  • TensorFlow version (use command below): v1.13.1-0-g6612da8951
  • Python version: Tested on both 3.5 and 3.6
  • Bazel version (if compiling from source): 0.19.2
  • GCC/Compiler version (if compiling from source): gcc-6 (Ubuntu 6.5.0-2ubuntu1~18.04) 6.5.0 20181026
  • CUDA/cuDNN version: Tested with 10.0/7.5 and 10.1/7.5
  • GPU model and memory: GTX 1660 TI 6GB

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
Using any convolution layers fails. Same error with all both versions of CUDA. ~/.nv directory cleared between runs. Same error in docker images as well. CUDNN verified to be working correctly with simple CUDNN programs (e.g. this)

Describe the expected behavior
Code should print an array of 16 numbers

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

2019-03-26 11:14:43.230340: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-26 11:14:43.330837: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-26 11:14:43.331452: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3d4a090 executing computations on platform CUDA. Devices:
2019-03-26 11:14:43.331472: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1660 Ti, Compute Capability 7.5
2019-03-26 11:14:43.356898: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3400255000 Hz
2019-03-26 11:14:43.357161: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3db1d00 executing computations on platform Host. Devices:
2019-03-26 11:14:43.357190: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-26 11:14:43.357646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.8
pciBusID: 0000:01:00.0
totalMemory: 5.77GiB freeMemory: 5.19GiB
2019-03-26 11:14:43.357674: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-26 11:14:43.358609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 11:14:43.358630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-26 11:14:43.358640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-26 11:14:43.359045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5016 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From simpletf_test/tf_convnet.py:11: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From simpletf_test/tf_convnet.py:12: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From simpletf_test/tf_convnet.py:13: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
2019-03-26 11:14:43.819308: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-03-26 11:14:44.479283: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-26 11:14:44.498430: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv2d/Conv2D}}]]
	 [[{{node dense/BiasAdd}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "simpletf_test/tf_convnet.py", line 22, in <module>
    main()
  File "simpletf_test/tf_convnet.py", line 18, in main
    out = sess.run(net)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv2d/Conv2D (defined at simpletf_test/tf_convnet.py:11) ]]
	 [[node dense/BiasAdd (defined at simpletf_test/tf_convnet.py:13) ]]

Caused by op 'conv2d/Conv2D', defined at:
  File "simpletf_test/tf_convnet.py", line 22, in <module>
    main()
  File "simpletf_test/tf_convnet.py", line 11, in main
    net = tf.layers.conv2d(net, filters=16, kernel_size=3)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/convolutional.py", line 424, in conv2d
    return layer.apply(inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1227, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/layers/base.py", line 530, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call
    outputs = self._convolution_op(inputs, self.kernel)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 966, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 591, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 208, in __call__
    name=self.name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node conv2d/Conv2D (defined at simpletf_test/tf_convnet.py:11) ]]
	 [[node dense/BiasAdd (defined at simpletf_test/tf_convnet.py:13) ]]

@Atom-101
Copy link

I am having the same issue on RTX 2080 and Tf 2.0, #27141

@achandraa achandraa self-assigned this Mar 28, 2019
@achandraa achandraa added type:bug Bug comp:keras Keras related issues labels Mar 28, 2019
@achandraa achandraa assigned ymodak and unassigned achandraa Apr 2, 2019
@achandraa achandraa added comp:ops OPs related issues and removed comp:keras Keras related issues labels Apr 2, 2019
@ymodak ymodak added comp:gpu GPU related issues and removed comp:ops OPs related issues labels Apr 2, 2019
@ymodak
Copy link
Contributor

ymodak commented Apr 2, 2019

Duplicate #24496
Can you please try the solution that worked for other users in the above issue? Thanks!

@ymodak ymodak added the stat:awaiting response Status - Awaiting response from author label Apr 2, 2019
@moodoki
Copy link
Author

moodoki commented Apr 3, 2019

Looks like setting allow_growth=True allows the toy program to run. Interestingly, using TF1.12 in the docker container works without this setting.

Is this the way forward and is there a way to set this by default? Seems like it will require me to touch a large number of files in my unit tests...

Thanks!

@ymodak
Copy link
Contributor

ymodak commented Apr 3, 2019

I am glad it hack worked for you. Unfortunately, Currently we don't have a feature to set allow_growth=True by default. I will close this issue since its resolved. Feel free to reopen if have any further questions. Thanks!

@ymodak ymodak closed this as completed Apr 3, 2019
@tensorflow-bot
Copy link

tensorflow-bot bot commented Apr 3, 2019

Are you satisfied with the resolution of your issue?
Yes
No

@sloanyyc
Copy link

in python keras package modify file tensorflow_backend.py
def get_session():

add by sloan fix CUDNN_STATUS_INTERNAL_ERROR

        config.gpu_options.allow_growth = True
        _SESSION = tf.Session(config=config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants