Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

Closed
bobiblazeski opened this issue Mar 29, 2019 · 8 comments
Closed

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #1469

bobiblazeski opened this issue Mar 29, 2019 · 8 comments
Assignees

Comments

@bobiblazeski
Copy link

To get help from the community, we encourage using Stack Overflow and the tensorflow.js tag.

TensorFlow.js version

{ 'tfjs-core': '1.0.3',
'tfjs-data': '1.0.3',
'tfjs-layers': '1.0.3',
'tfjs-converter': '1.0.3',
tfjs: '1.0.3',
'tfjs-node': '1.0.2' }

Browser version

Running on node
Ubuntu 18.04

$ nvidia-smi
Fri Mar 29 19:25:37 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| N/A 46C P8 9W / N/A | 879MiB / 7952MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

Describe the problem or feature request

I'm unable to use cudnn convolutional layers in my model on tfjs-node-gpu
Possibly related due to issues with RTX series in this tensorflow workaround there is suggestion to use
config.gpu_options.allow_growth = True

Is there such option in tensorflow js?

Code to reproduce the bug / link to feature request

const tf = require('@tensorflow/tfjs-node-gpu');
const model =  tf.sequential({
    layers: [      
      tf.layers.conv2d({
        inputShape:[32, 32, 3],
        filters: 32, 
        kernelSize: [3, 3],
        activation: 'relu',
      }),
      tf.layers.maxPooling2d([2, 2]),      
    ],
  });
model.predict(tf.randomNormal([4, 32, 32, 3]))
     .then((res) => {
         res.print();
     })

$ node index.js
2019-03-29 19:22:37.112495: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-03-29 19:22:37.249964: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-29 19:22:37.250443: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa4000 executing computations on platform CUDA. Devices:
2019-03-29 19:22:37.250458: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-03-29 19:22:37.271245: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-03-29 19:22:37.271958: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3aa2750 executing computations on platform Host. Devices:
2019-03-29 19:22:37.271972: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-03-29 19:22:37.272241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.80GiB
2019-03-29 19:22:37.272275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-29 19:22:37.273295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-29 19:22:37.273308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-29 19:22:37.273314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-29 19:22:37.273435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6612 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-03-29 19:22:38.761993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-03-29 19:22:38.763178: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:132
throw ex;
^

Error: Invalid TF_Status: 2
Message: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
at NodeJSKernelBackend.executeSingleOutput (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:192:43)
at NodeJSKernelBackend.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:700:21)
at environment_1.ENV.engine.runKernel.x (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:152:27)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:171:26
at Engine.scopedRun (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:126:23)
at Engine.runKernel (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:169:14)
at conv2d_ (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/conv.js:151:40)
at Object.conv2d (/home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/ops/operation.js:46:29)
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-layers/dist/layers/convolutional.js:198:17
at /home/bobi/Desktop/cudnn/node_modules/@tensorflow/tfjs-core/dist/engine.js:116:22

@bobiblazeski
Copy link
Author

Same error happens even when there is no convolutional layers in model.
Models

const actor = () => tf.sequential({
    layers: [
      tf.layers.inputLayer({inputShape: STATE_SIZE}),
      tf.layers.batchNormalization(),
      tf.layers.dense({units: ACTION_SIZE*2, activation:'relu'}),
      tf.layers.dense({units: ACTION_SIZE, activation:'softmax'}),
    ],
  });

  const critic = () => {
    const stateInput = tf.input({shape: [STATE_SIZE]});
    const actionInput = tf.input({shape: [ACTION_SIZE]});
    const bn = tf.layers.batchNormalization().apply(stateInput);
    const d1 = tf.layers.dense({units: ACTION_SIZE*2, activation: 'relu'})
      .apply(bn);
    const d2 = tf.layers.dense({units: ACTION_SIZE,
      activation: 'softmax'}).apply(d1);
    const concat = tf.layers.concatenate().apply([d2, actionInput]);
    const d3 = tf.layers.dense({units: ACTION_SIZE, 
      activation: 'relu'}).apply(concat);
    const output = tf.layers.dense({units: 1}).apply(d3);
    return tf.model({inputs: [stateInput, actionInput], outputs: output});
  }

$ node server/start.js
2019-04-03 20:26:24.022854: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-03 20:26:24.151743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-03 20:26:24.152219: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac43c0 executing computations on platform CUDA. Devices:
2019-04-03 20:26:24.152233: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2019-04-03 20:26:24.171244: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2208000000 Hz
2019-04-03 20:26:24.171685: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3ac2b10 executing computations on platform Host. Devices:
2019-04-03 20:26:24.171699: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-04-03 20:26:24.171843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 6.57GiB
2019-04-03 20:26:24.171855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-04-03 20:26:24.172565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-03 20:26:24.172575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-04-03 20:26:24.172579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-04-03 20:26:24.172688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6389 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Starting with random weights.
(node:20980) ExperimentalWarning: The fs.promises API is experimental
Listening on 3000
connection
2019-04-03 20:26:27.335442: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-03 20:26:27.335505: W ./tensorflow/stream_executor/stream.h:2099] attempting to perform DNN operation using StreamExecutor without DNN support
2019-04-03 20:26:27.346775: I tensorflow/stream_executor/stream.cc:2079] [stream=0x4a7f370,impl=0x4a7f410] did not wait for [stream=0x4a7ed90,impl=0x4a76260]
2019-04-03 20:26:27.346799: I tensorflow/stream_executor/stream.cc:5027] [stream=0x4a7f370,impl=0x4a7f410] did not memcpy host-to-device; source: 0x4a02a980
2019-04-03 20:26:27.346837: F tensorflow/core/common_runtime/gpu/gpu_util.cc:339] CPU->GPU Memcpy failed

@adwellj
Copy link

adwellj commented Apr 9, 2019

@bobiblazeski, have you found any resolution to this issue?

I'm currently blocked by this same error.
Ubuntu 18.04
GTX 1660; Driver 418.56; CUDA 10.1 (even though I followed the instructions for 10.0...)

@bobiblazeski
Copy link
Author

@adwellj Nope I'm training on CPU until this is resolved.

@adwellj
Copy link

adwellj commented Apr 10, 2019

@bobiblazeski, I punted over to trying on Windows and finally just got this working. I had to drop down to tfjs-node-gpu version 0.3.2 due to node-gyp issues.

However, once I finally got it to install, I later ran in to this same CuDNN issue! Fortunately, using CUDA 9.0 (needed for 0.3.2 compatibility) I got a better error message before the "This is probably because cuDNN failed to initialize..." message, stating that tfjs-node-gpu was built against CuDNN version 7.2. Once I downloaded that version, everything is working.

I haven't went back to see if I could get it to work on the LINUX install, but I'm hoping that this could just be a CuDNN version incompatibility issue that you could experiment with. Luckily CuDNN doesn't have an install / uninstall process; it's simply copying the extracted files in to a dedicated directory that you include in your system path.

I hope that helps give you some possible direction!

@piercus
Copy link
Contributor

piercus commented Oct 1, 2019

As explained in #671 (comment)

There is workaround by setting global variable
export TF_FORCE_GPU_ALLOW_GROWTH=true

@3lk0k0
Copy link

3lk0k0 commented Jan 30, 2020

@adwellj Nope I'm training on CPU until this is resolved.

have you solve to use GPU ?

@Infinitay
Copy link

Infinitay commented Feb 14, 2020

I am having this issue too, but it seems to resolve itself only when I restart my computer. This seems rather odd to me. I notice that the issue tends to happen after terminating my application(s) that utilize tfjs.

EDIT: I tried adding TF_FORCE_GPU_ALLOW_GROWTH=true as an environment variable, and it seemed to have worked briefly, but upon trying to run my program once more, the error started appearing again.

@rthadur
Copy link
Contributor

rthadur commented Feb 14, 2020

this seems to be a duplicate of #671 , we will close this and track the issue at one place.Thank you

@rthadur rthadur closed this as completed Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants