Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_INTERNAL_ERROR #1311

Closed
mtyrolski opened this issue Dec 16, 2020 · 1 comment
Closed

CUDNN_STATUS_INTERNAL_ERROR #1311

mtyrolski opened this issue Dec 16, 2020 · 1 comment

Comments

@mtyrolski
Copy link
Contributor

mtyrolski commented Dec 16, 2020

Description

I try to train the model on the cluster and constantly get an error as soon as the model starts training:

Failed to get convolution algorithm.

Convolution performance may be suboptimal.
2020-12-16 01:16:35.481299: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481342: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481377: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481416: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:772] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for convolution %custom-call.484 = (f32[3,512,512]{0,1,2}, u8[0]{0}) custom-call(f32[1,3072,512]{1,2,0} %add.822, f32[1,1024,512]{1,2,0} %add.16755), window={size=3 stride=3}, dim_labels=b0f_0io->b0f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="conv_general_dilated" op_name="jit(single_device_update_fn)/conv_general_dilated[ batch_group_count=1\n                                                   dimension_numbers=ConvDimensionNumbers(lhs_spec=(2, 0, 1), rhs_spec=(2, 0, 1), out_spec=(1, 2, 0))\n                                                   feature_group_count=1\n                                                   lhs_dilation=(1,)\n                                                   lhs_shape=(1, 3072, 512)\n                                                   padding=((0, 0),)\n                                                   precision=None\n                                                   rhs_dilation=(3,)\n                                                   rhs_shape=(1, 1024, 512)\n                                                   window_strides=(1,) ]" source_file="/home/mtyrolski/vatican_trax_workspace/20201216_005356/venv/lib/python3.8/site-packages/trax/fastmath/jax.py" source_line=53}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm. 

I tried a lot of proposed solutions from tensorflow issues like tensorflow/tensorflow#24496 but unfortunately none of them helps.
Important note - the issue occurs if and only if we use Convolution layer in our model.

Environment information

We use the newest version of the trax.

mesh-tensorflow==0.1.17
tensor2tensor==1.15.7
tensorboard==2.4.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-addons==0.11.2
tensorflow-datasets==4.1.0
tensorflow-estimator==2.3.0
tensorflow-gan==2.0.0
tensorflow-hub==0.10.0
tensorflow-metadata==0.25.0
tensorflow-probability==0.7.0
tensorflow-text==2.3.0
jax==0.2.5
jaxlib==0.1.57


CUDA 10.1.243
cuDNN 7.6.4
Python 3.8.2

Steps to reproduce:

...

TF_FORCE_GPU_ALLOW_GROWTH=true XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda pip3 install --upgrade jax jaxlib==0.1.57+cuda101 -f https://storage.googleapis.com/jax-releases/jax_releases.html
TF_FORCE_GPU_ALLOW_GROWTH=true XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda python3 -m trax.trainer --config_file=1.gin --output_dir=./
@mtyrolski
Copy link
Contributor Author

export TF_FORCE_GPU_ALLOW_GROWTH=true
export LD_LIBRARY_PATH=/usr/local/cuda-11/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/lib/cuda/lib64:$LD_LIBRARY_PATH
XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda python3 -m trax.trainer

fixed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant