Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #24496

Closed
michaelmyc opened this issue Dec 21, 2018 · 186 comments
Closed

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #24496

michaelmyc opened this issue Dec 21, 2018 · 186 comments
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@michaelmyc
Copy link

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes and No (described below)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Manjaro
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): tf-nightly-gpu (Dec 19, r1.13)
  • TensorFlow version (use command below): 1.13.0-dev20181219
  • Python version: 3.7.1
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: CUDA 10 with cuDNN 7.4.1
  • GPU model and memory: RTX 2070 8GB

Describe the current behavior
I'm running the CNN model on MNIST. When I'm running with the GPU, I am encountering
2018-12-20 20:09:13.644176: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I did some digging and realized that it is a memory issue (which shouldn't be the case as I have 32GB of RAM and 64GB of swap. I ran htop when running the model and I have 20+GB free, which is more than enough to fit the 8GB vRAM mappings.

Using the gpu_options.allow_growth = True gets the model to work properly, and setting os.environ['CUDA_VISIBLE_DEVICES'] = '-1' also works. This means that I AM facing a memory issue, but I don't see how.

Also, using gpu_options.allow_growth = True does not fix the same issue when trying to run tensorflow/models/official/mnist/ model, which should have a similar behavior with my code.

Code to reproduce the issue

import os
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import math
import time
# Killing optional CPU driver warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
tf.logging.set_verbosity(tf.logging.ERROR)


class Model:

    def __init__(self, image, label):
        """
        A Model class contains a computational graph that classifies images
        to predictions. Each of its methods builds part of the graph
        on Model initialization. Do not modify the constructor, as doing so
        would break the autograder. You may, however, add class variables
        to use in your graph-building. e.g. learning rate, 

        image: the input image to the computational graph as a tensor
        label: the correct label of an image as a tensor
        prediction: the output prediction of the computational graph,
                    produced by self.forward_pass()
        optimize: the model's optimizing tensor produced by self.optimizer()
        loss: the model's loss produced by computing self.loss_function()
        accuracy: the model's prediction accuracy
        """
        self.image = image
        self.label = label

        # TO-DO: Add any class variables you want to use.

        self.prediction = self.forward_pass()
        self.loss = self.loss_function()
        self.optimize = self.optimizer()
        self.accuracy = self.accuracy_function()

    def forward_pass(self):
        """
        Predicts a label given an image using convolution layers

        :return: the prediction as a tensor
        """
        filter_1 = tf.Variable(tf.truncated_normal([3, 3, 1, 8], stddev=0.1))
        conv_1 = tf.nn.conv2d(self.image, filter_1, [1, 1, 1, 1], "SAME")

        reshaped = tf.reshape(conv_1, shape=[50, -1])

        L1 = reshaped.shape[1].value
        L2 = 500
        W1 = tf.Variable(tf.random_normal([L1, L2], mean=0, stddev=0.01))
        b1 = tf.Variable(tf.random_normal([L2], mean=0, stddev=0.01))
        relu_1 = tf.nn.relu(tf.matmul(reshaped, W1) + b1)

        W2 = tf.Variable(tf.random_normal([L2, 10], mean=0, stddev=0.01))
        b2 = tf.Variable(tf.random_normal([10], mean=0, stddev=0.01))
        logits = tf.nn.relu(tf.matmul(relu_1, W2) + b2)
        return logits

    def loss_function(self):
        """
        Calculates the model cross-entropy loss

        :return: the loss of the model as a tensor
        """
        loss = tf.losses.softmax_cross_entropy(onehot_labels=self.label, logits=self.prediction)
        return loss

    def optimizer(self):
        """
        Optimizes the model loss using an Adam Optimizer

        :return: the optimizer as a tensor
        """
        learning_rate = 0.1
        sgd = tf.train.GradientDescentOptimizer(learning_rate)
        train = sgd.minimize(self.loss)
        return train

    def accuracy_function(self):
        """
        Calculates the model's prediction accuracy by comparing
        predictions to correct labels – no need to modify this

        :return: the accuracy of the model as a tensor
        """
        correct_prediction = tf.equal(tf.argmax(self.prediction, 1),
                                      tf.argmax(self.label, 1))
        return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


def main():
    t_start = time.time()

    mnist = input_data.read_data_sets("data/mnist/", one_hot=True)
    batch_sz = 50
    batch = 2000

    inputs = tf.placeholder(shape=[batch_sz, 28, 28, 1], dtype=tf.float32)
    labels = tf.placeholder(shape=[batch_sz, 10], dtype=tf.float32)

    model = Model(inputs, labels)

    session_config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
    sess = tf.Session(config=session_config)

    # sess = tf.Session()

    sess.run(tf.global_variables_initializer())
    for i in range(batch):
        next_image, next_label = mnist.train.next_batch(batch_sz)
        next_image = next_image.reshape((batch_sz, 28, 28, 1))
        sess.run(model.optimize, feed_dict={inputs: next_image, labels: next_label})

    acc, test_images, test_labels = 0, mnist.test.images, mnist.test.labels
    test_batch = math.ceil(len(test_images) / batch_sz)
    for i in range(test_batch):
        batch_images = test_images[i * batch_sz: (i + 1) * batch_sz]
        batch_images = batch_images.reshape((batch_sz, 28, 28, 1))
        batch_labes = test_labels[i * batch_sz: (i + 1) * batch_sz]
        acc += sess.run(model.accuracy, feed_dict={inputs: batch_images, labels: batch_labes})
    acc /= test_batch
    print(acc)

    print(time.time() - t_start, 'seconds')

    return


if __name__ == '__main__':
    main()
@va-andrew
Copy link

I've been running into the same issue with the same GPU: "CUDNN_STATUS_INTERNAL_ERROR".

RTX 2070 GPU
CUDA 10
cuDNN 7.4.2
Ubuntu 18.04
tf-nightly-gpu (r1.13, Jan 13)
Python 3.6.7

2019-01-15 05:01:03.503415: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA li
brary libcublas.so.10.0 locally
2019-01-15 05:01:03.752563: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA li
brary libcudnn.so.7 locally
2019-01-15 05:01:04.905618: E tensorflow/stream_executor/cuda/cuda_dnn.cc:493] Could not create cudnn handle: CUDNN_STAT
US_INTERNAL_ERROR
2019-01-15 05:01:04.908147: E tensorflow/stream_executor/cuda/cuda_dnn.cc:493] Could not create cudnn handle: CUDNN_STAT
US_INTERNAL_ERROR
2019-01-15 05:01:04.908191: W tensorflow/core/framework/op_kernel.cc:1412] OP_REQUIRES failed at conv_ops_fused.cc:801 :
 Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to se
e if a warning log message was printed above.

@dennisjay
Copy link

dennisjay commented Jan 17, 2019

I've the same problem running on

RTX2080 GPU
CUDA 10
cudnn 7.4.2

I tried the following tf Versions tf-nightly-gpu and a self compiled Version from master (060b6e3).
I found out that its possible to set the following ENVIRONMENT Variables to get further Debug Info.

CUDNN_LOGINFO_DBG=1;
CUDNN_LOGDEST_DBG=stdout

Then i get the following error:

I0117 14:11:24.441819 140433563125568 basic_session_run_hooks.py:594] Saving checkpoints for 0 into /tmp/mnist/model.ckpt.
2019-01-17 14:11:25.916269: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.10.0 locally

I! CuDNN (v7402) function cudnnCreate() called:
i! Time: 2019-01-17T14:11:26.079184 (0d+0h+0m+0s since start)
i! Process=29255; Thread=29356; GPU=NULL; Handle=NULL; StreamId=NULL.

2019-01-17 14:11:26.079151: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally

I! CuDNN (v7402) function cudnnCreate() called:
i! Time: 2019-01-17T14:11:26.571897 (0d+0h+0m+0s since start)
i! Process=29255; Thread=29356; GPU=NULL; Handle=NULL; StreamId=NULL.

2019-01-17 14:11:26.571858: E tensorflow/stream_executor/cuda/cuda_dnn.cc:493] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-01-17 14:11:26.579375: E tensorflow/stream_executor/cuda/cuda_dnn.cc:493] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I! CuDNN (v7402) function cudnnCreate() called:
i! Time: 2019-01-17T14:11:26.579803 (0d+0h+0m+0s since start)
i! Process=29255; Thread=29356; GPU=NULL; Handle=NULL; StreamId=NULL.

2019-01-17 14:11:26.585818: E tensorflow/stream_executor/cuda/cuda_dnn.cc:493] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-01-17 14:11:26.585850: W ./tensorflow/stream_executor/stream.h:2109] attempting to perform DNN operation using StreamExecutor without DNN support
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1320, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1408, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node Discriminator_1/Conv/Conv2D}}]]
[[train/discriminator_train/train_op/control_dependency/_569]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/dj/projects/gan/tf_models/research/gan/mnist/train.py", line 151, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/dj/projects/gan/tf_models/research/gan/mnist/train.py", line 147, in main
get_hooks_fn=tfgan.get_joint_train_hooks())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/gan/python/train.py", line 1200, in gan_train
config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/training/python/training/training.py", line 546, in train
loss = session.run(train_op, run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 693, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1188, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1287, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1272, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1336, in run
feed_dict, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1362, in _call_hook_before_run
request = hook.before_run(run_context)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/gan/python/train.py", line 1061, in before_run
run_context.session.run(self._train_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 930, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1153, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1329, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1349, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node Discriminator_1/Conv/Conv2D (defined at home/dj/projects/gan/tf_models/research/gan/mnist/networks.py:152) ]]
[[train/discriminator_train/train_op/control_dependency/_569]]

Errors may have originated from an input operation.
Input Source operations connected to node Discriminator_1/Conv/Conv2D:
inputs/batch/n (defined at home/dj/projects/gan/tf_models/research/gan/mnist/data_provider.py:67)

Original stack trace for 'Discriminator_1/Conv/Conv2D':
File "home/dj/projects/gan/tf_models/research/gan/mnist/train.py", line 151, in
tf.app.run()
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "home/dj/projects/gan/tf_models/research/gan/mnist/train.py", line 87, in main
[FLAGS.batch_size, FLAGS.noise_dims]))
File "usr/local/lib/python3.6/dist-packages/tensorflow/contrib/gan/python/train.py", line 118, in gan_model
discriminator_real_outputs = discriminator_fn(real_data, generator_inputs)
File "home/dj/projects/gan/tf_models/research/gan/mnist/networks.py", line 176, in unconditional_discriminator
net = _discriminator_helper(img, False, None, weight_decay)
File "home/dj/projects/gan/tf_models/research/gan/mnist/networks.py", line 152, in _discriminator_helper
net = layers.conv2d(img, 64, [4, 4], stride=2)
File "usr/local/lib/python3.6/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1155, in convolution2d
conv_dims=2)
File "usr/local/lib/python3.6/dist-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/layers.py", line 1058, in convolution
outputs = layer.apply(inputs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1228, in apply
return self.call(inputs, *args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 531, in call
outputs = super(Layer, self).call(inputs, *args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 564, in call
outputs = self.call(inputs, *args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 196, in call
outputs = self._convolution_op(inputs, self.kernel)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 966, in call
return self.conv_op(inp, filter)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 591, in call
return self.call(inp, filter)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 208, in call
name=self.name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_ops.py", line 1578, in conv2d
name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1040, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 501, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

Any ideas somebody? I am just before reinstalling my complete environement :-(

@michaelmyc
Copy link
Author

Try to compile r1.13 from source. It would take a long time, but it should fix your problem. At least it fixed mine.

@va-andrew
Copy link

I did try compiling from source, but ran into the same issue. I was finally able to fix my problem was setting config.gpu_options.allow_growth = True.

@nickovs
Copy link

nickovs commented Jan 22, 2019

I've been having the same issue (on an RTX 2060, Ubuntu 18.04, Python 3.6.7, CUDA 10.0.130, cuDNN 7.4.2, Tensorflow 1.13.0-rc0 from source). Thanks to @va-andrew's suggestion I have it working with the allow_growth option set.

FWIW, in the course of searching for solutions to this it seems that this issue is a common problem with the RTX series (although it might be a general problem with CUDA 10.0, since the new cards don't support the older versions). It would be great if the defaults could get updated in the release of 1.13 so that special options don't need to be set for these cards.

@newhouseb
Copy link

Chiming in to say I also experienced this under the following configuration:

Tensorflow Docker GPU containers with stable releases of everything don't work either (they straight up segfault rather than report CUDNN_STATUS_INTERNAL_ERROR).

Curiously, things work fine on Windows 10 with Tensorflow v1.12!

And has others have reported, setting allow_growth allows things to run properly.

@nkdsoft
Copy link

nkdsoft commented Jan 29, 2019

Same problem here.

  • RTX 2070
  • Ubuntu 18.04
  • CudNN 7.4.2 (but I have tried compiling with other older versions with no luck)
  • Tensorflow 1.13.0-dev20190125 (also tried Tensorflow 1.12 compiled with Cuda 10)

And as others have reported, setting allow_growth=TRUE allows things to run.

@ymodak ymodak added the comp:gpu GPU related issues label Jan 31, 2019
@ymodak
Copy link
Contributor

ymodak commented Jan 31, 2019

Closing this issue since its resolved. Thanks!

@ymodak ymodak closed this as completed Jan 31, 2019
@nickovs
Copy link

nickovs commented Jan 31, 2019

@ymodak Can you please reference the PR that fixed this bug?

@peterroelants
Copy link

I have a similar issue with tf-nightly-gpu-2.0-preview on the RTX 2080

@hoermannpaul
Copy link

Same issue with an RTX2080, spent two days recompiling and bug hunting until I found this fix.
(the allow_growth=true thing fixed it)

You made my day

@oscarlinux
Copy link

How do you actually set allow_growth=true? I have tf-nightly-gpu-2.0-preview and tried:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

but get this error:

AttributeError Traceback (most recent call last)
in ()
1 import tensorflow as tf
----> 2 config = tf.ConfigProto()

AttributeError: module 'tensorflow' has no attribute 'ConfigProto'

How can I set allow_growth in tensorflow 2.0?

@oscarlinux
Copy link

ok, made it work in tf-nightly-gpu-2.0-preview and ipython notebook adding this to my code:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

@sandacn
Copy link

sandacn commented Mar 27, 2019

same issue, with gpu_options.allow_growth = True the issue fixed.

@diego898
Copy link

diego898 commented Apr 1, 2019

@newhouseb how/where did you set that true for all benchmarks? Was it an easy change?

@samhodge
Copy link

samhodge commented Apr 6, 2019

Is blanket allow growth a solution ?

It is turned off by default for a reason see
https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth

In my program memory management is important

I would like to limit the amount of GPU used by TF because in my graphics application the GPU memory will be used for other things and putting it into a limited space is important to prevent out of memory errors

@samhodge
Copy link

samhodge commented Apr 7, 2019

I am working in C++ under Windows

Adding the allow growth option results in an OOM error.

Without this line of code the model runs fine on the same machine with the same card.

With OOM error

options.config.mutable_gpu_options()->set_allow_growth(true);
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(fraction);

Without OOM error

//options.config.mutable_gpu_options()->set_allow_growth(true);
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(fraction);

So to solve this problem with set allow growth results in a segfault.

@yorickvP
Copy link

@ymodak This bug is not fixed. Arguably, using any sort of convnet should work in the default configuration. Either allow_growth should be true by default, it should be fixed so this works, or there should be a better error than CUDNN_STATUS_INTERNAL_ERROR.

@nickovs
Copy link

nickovs commented Apr 13, 2019

@ymodak It looks like this issue was closed prematurely. While there is a work-around for this issue it involves changing application code. As a result the example code does not work out of the box on RTX cards and most recipes on line will also need modification.

@ymodak ymodak reopened this Apr 13, 2019
@ymodak ymodak added the type:bug Bug label Apr 13, 2019
@roebel
Copy link

roebel commented Aug 21, 2020

In case your problem has the same origin as the problems that are treated in the present issue (which I cannot know from your report) then there are a few solutions that you can easily find by means of reading the last 10-20 posts in this thread.

@bigboy32
Copy link

I Fixed it with this:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
sess.as_default()

@Gangadharsmg
Copy link

Gangadharsmg commented Aug 24, 2020

I had this same issue with RTX 2080. Then following code worked for me.

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Thanks everyone

@nikste
Copy link
Contributor

nikste commented Aug 24, 2020

I think we can stop posting the allow_growth fix now :)

@drscotthawley
Copy link

drscotthawley commented Oct 17, 2020

RTX 2070 here. Was getting this error, but now running with TF_FORCE_GPU_ALLOW_GROWTH=true (as other commenters have pointed out, fixes it for them) changes the error message to an out of memory error (even though I've got plenty of memory):

2020-10-17 16:35:11.717658: I tensorflow/stream_executor/cuda/cuda_driver.cc:831] failed to allocate 3.87G (4159818752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

But my GPU has 8GB and only about 250MB were in use before I started the process. So I don't understand, why can't it allocate 3.87GB? (lowering batch size had no effect; the weights hdf5 file is less than 200MB)

@TiruBokka
Copy link

TF_FORCE_GPU_ALLOW_GROWTH=true worked for me.
tf.config.experimental.set_memory_growth(gpu, True) worked too.

Here is my configuration:
GPU GTX 1650
cuda-10-1 10.1.243-1
libcudnn7 7.6.5.32-1+cuda10.1
Ubuntu 18.04.5 LTS

Whoever cannot set the environment variable, could try this as suggested in https://www.tensorflow.org/guide/gpu:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)

@sachinkmohan
Copy link

Typing the command mentioned on the terminal just worked for me.

tensorflow/tfjs#671 (comment)

@zzhuolun
Copy link

zzhuolun commented Nov 12, 2020

Just upgrade to Tensorflow 2.3 with CUDA 11 and cudnn 8.0. It magically solved all my problems and I don't even need the workaround with config.gpu_options.allow_growth = True now.

It seems that the issue is noticed and solved in tensorflow 2.3.0.

  • CUDA 10.1
  • GPU: Quadro RTX 6000
  • Tensorflow 2.2.0
  • cudnn 7.6.5

Same problem:
tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

And the workaround allow_growth = True does not help.

After I upgrade tensorflow to 2.3.0, the problem disappeared, even without adding the line allow_growth = True .

@duongdqq
Copy link

ok, made it work in tf-nightly-gpu-2.0-preview and ipython notebook adding this to my code:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

it works in my case

@wojdzi1607
Copy link

ok, made it work in tf-nightly-gpu-2.0-preview and ipython notebook adding this to my code:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

It works, paste to start python file you execute. Ubuntu 20.04, docker Nvidia, tensorflow 1.15, GTX 1060

@LiUzHiAn
Copy link

Hi,

The config.gpu_options.allow_growth = True option also works well with Keras. One can initialize a session and specify it to Keras, just something as follows:

from tensorflow.keras import backend as K
import tensorflow as tf

session_config = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
sess = tf.Session(config=session_config)
K.set_session(sess)

Hope it helps.

@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 26, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests