Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

bobiblazeski · 2019-03-29T18:30:04Z

To get help from the community, we encourage using Stack Overflow and the tensorflow.js tag.

TensorFlow.js version

{ 'tfjs-core': '1.0.3',
'tfjs-data': '1.0.3',
'tfjs-layers': '1.0.3',
'tfjs-converter': '1.0.3',
tfjs: '1.0.3',
'tfjs-node': '1.0.2' }

Browser version

Running on node
Ubuntu 18.04

$ nvidia-smi
Fri Mar 29 19:25:37 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| N/A 46C P8 9W / N/A | 879MiB / 7952MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

Describe the problem or feature request

I'm unable to use cudnn convolutional layers in my model on tfjs-node-gpu
Possibly related due to issues with RTX series in this tensorflow workaround there is suggestion to use
config.gpu_options.allow_growth = True

Is there such option in tensorflow js?

Code to reproduce the bug / link to feature request

const tf = require('@tensorflow/tfjs-node-gpu');
const model =  tf.sequential({
    layers: [      
      tf.layers.conv2d({
        inputShape:[32, 32, 3],
        filters: 32, 
        kernelSize: [3, 3],
        activation: 'relu',
      }),
      tf.layers.maxPooling2d([2, 2]),      
    ],
  });
model.predict(tf.randomNormal([4, 32, 32, 3]))
     .then((res) => {
         res.print();
     })

$ node index.js
2019-03-29 19:22:37.112495: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-03-29 19:22:37.249964: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-29 19:22:37.250443: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa4000 executing computations on platform CUDA. Devices:
2019-03-29 19:22:37.250458: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-03-29 19:22:37.271245: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-03-29 19:22:37.271958: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa2750 executing computations on platform Host. Devices:
2019-03-29 19:22:37.271972: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-03-29 19:22:37.272241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.80GiB
2019-03-29 19:22:37.272275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-29 19:22:37.273295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 19:22:37.273308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-29 19:22:37.273314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-29 19:22:37.273435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6612 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-03-29 19:22:38.761993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-29 19:22:38.763178: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:132
throw ex;
^

Error: Invalid TF_Status: 2
Message: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
at NodeJSKernelBackend.executeSingleOutput (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:192:43)
at NodeJSKernelBackend.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:700:21)
at environment_1.ENV.engine.runKernel.x (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:152:27)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:171:26
at Engine.scopedRun (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:126:23)
at Engine.runKernel (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:169:14)
at conv2d_ (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:151:40)
at Object.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/operation.js:46:29)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-layers/dist/layers/convolutional.js:198:17
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:116:22

The text was updated successfully, but these errors were encountered:

bobiblazeski · 2019-04-03T18:28:22Z

Same error happens even when there is no convolutional layers in model.
Models

const actor = () => tf.sequential({
    layers: [
      tf.layers.inputLayer({inputShape: STATE_SIZE}),
      tf.layers.batchNormalization(),
      tf.layers.dense({units: ACTION_SIZE*2, activation:'relu'}),
      tf.layers.dense({units: ACTION_SIZE, activation:'softmax'}),
    ],
  });

  const critic = () => {
    const stateInput = tf.input({shape: [STATE_SIZE]});
    const actionInput = tf.input({shape: [ACTION_SIZE]});
    const bn = tf.layers.batchNormalization().apply(stateInput);
    const d1 = tf.layers.dense({units: ACTION_SIZE*2, activation: 'relu'})
      .apply(bn);
    const d2 = tf.layers.dense({units: ACTION_SIZE,
      activation: 'softmax'}).apply(d1);
    const concat = tf.layers.concatenate().apply([d2, actionInput]);
    const d3 = tf.layers.dense({units: ACTION_SIZE, 
      activation: 'relu'}).apply(concat);
    const output = tf.layers.dense({units: 1}).apply(d3);
    return tf.model({inputs: [stateInput, actionInput], outputs: output});
  }

$ node server/start.js
2019-04-03 20:26:24.022854: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-03 20:26:24.151743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-03 20:26:24.152219: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac43c0 executing computations on platform CUDA. Devices:
2019-04-03 20:26:24.152233: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-04-03 20:26:24.171244: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-04-03 20:26:24.171685: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac2b10 executing computations on platform Host. Devices:
2019-04-03 20:26:24.171699: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-04-03 20:26:24.171843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.57GiB
2019-04-03 20:26:24.171855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-03 20:26:24.172565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-03 20:26:24.172575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-03 20:26:24.172579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-03 20:26:24.172688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6389 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Starting with random weights.
(node:20980) ExperimentalWarning: The fs.promises API is experimental
Listening on 3000
connection
2019-04-03 20:26:27.335442: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-03 20:26:27.335505: W ./tensorflow/stream_executor/stream.h:2099] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-03 20:26:27.346775: I tensorflow/stream_executor/stream.cc:2079] [stream=0x4a7f370,impl=0x4a7f410] did not wait for [stream=0x4a7ed90,impl=0x4a76260]
2019-04-03 20:26:27.346799: I tensorflow/stream_executor/stream.cc:5027] [stream=0x4a7f370,impl=0x4a7f410] did not memcpy host-to-device; source: 0x4a02a980
2019-04-03 20:26:27.346837: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed

adwellj · 2019-04-09T13:36:11Z

@bobiblazeski, have you found any resolution to this issue?

I'm currently blocked by this same error.
Ubuntu 18.04
GTX 1660; Driver 418.56; CUDA 10.1 (even though I followed the instructions for 10.0...)

bobiblazeski · 2019-04-09T13:40:13Z

@adwellj Nope I'm training on CPU until this is resolved.

adwellj · 2019-04-10T21:15:22Z

@bobiblazeski, I punted over to trying on Windows and finally just got this working. I had to drop down to tfjs-node-gpu version 0.3.2 due to node-gyp issues.

However, once I finally got it to install, I later ran in to this same CuDNN issue! Fortunately, using CUDA 9.0 (needed for 0.3.2 compatibility) I got a better error message before the "This is probably because cuDNN failed to initialize..." message, stating that tfjs-node-gpu was built against CuDNN version 7.2. Once I downloaded that version, everything is working.

I haven't went back to see if I could get it to work on the LINUX install, but I'm hoping that this could just be a CuDNN version incompatibility issue that you could experiment with. Luckily CuDNN doesn't have an install / uninstall process; it's simply copying the extracted files in to a dedicated directory that you include in your system path.

I hope that helps give you some possible direction!

piercus · 2019-10-01T12:37:41Z

As explained in #671 (comment)

There is workaround by setting global variable
export TF_FORCE_GPU_ALLOW_GROWTH=true

3lk0k0 · 2020-01-30T02:00:54Z

@adwellj Nope I'm training on CPU until this is resolved.

have you solve to use GPU ?

Infinitay · 2020-02-14T04:21:51Z

I am having this issue too, but it seems to resolve itself only when I restart my computer. This seems rather odd to me. I notice that the issue tends to happen after terminating my application(s) that utilize tfjs.

EDIT: I tried adding TF_FORCE_GPU_ALLOW_GROWTH=true as an environment variable, and it seemed to have worked briefly, but upon trying to run my program once more, the error started appearing again.

rthadur · 2020-02-14T17:30:03Z

this seems to be a duplicate of #671 , we will close this and track the issue at one place.Thank you

rthadur added the comp:layers label Mar 29, 2019

rthadur assigned pyu10055 and nsthorat Mar 29, 2019

nsthorat assigned nkreeger and unassigned pyu10055 and nsthorat Mar 29, 2019

nsthorat added comp:node.js and removed comp:layers labels Mar 29, 2019

nsthorat assigned caisq and unassigned nkreeger Apr 19, 2019

piercus mentioned this issue Jun 13, 2019

tfjs-node-gpu - CUBLAS_STATUS_NOT_INITIALIZED : Blas xGEMM launch failed #1660

Closed

Infinitay mentioned this issue Feb 14, 2020

Add the ability to specify gpu_options #671

Closed

rthadur closed this as completed Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

bobiblazeski commented Mar 29, 2019

bobiblazeski commented Apr 3, 2019

adwellj commented Apr 9, 2019

bobiblazeski commented Apr 9, 2019

adwellj commented Apr 10, 2019

piercus commented Oct 1, 2019

3lk0k0 commented Jan 30, 2020

Infinitay commented Feb 14, 2020 •

edited

rthadur commented Feb 14, 2020

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

Comments

bobiblazeski commented Mar 29, 2019

TensorFlow.js version

Browser version

Describe the problem or feature request

Code to reproduce the bug / link to feature request

bobiblazeski commented Apr 3, 2019

adwellj commented Apr 9, 2019

bobiblazeski commented Apr 9, 2019

adwellj commented Apr 10, 2019

piercus commented Oct 1, 2019

3lk0k0 commented Jan 30, 2020

Infinitay commented Feb 14, 2020 • edited

rthadur commented Feb 14, 2020

Infinitay commented Feb 14, 2020 •

edited