You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to train the model on the cluster and constantly get an error as soon as the model starts training:
Failed to get convolution algorithm.
Convolution performance may be suboptimal.
2020-12-16 01:16:35.481299: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481342: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481377: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:349] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-16 01:16:35.481416: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:772] Failed to determine best cudnn convolution algorithm: Internal: All algorithms tried for convolution %custom-call.484 = (f32[3,512,512]{0,1,2}, u8[0]{0}) custom-call(f32[1,3072,512]{1,2,0} %add.822, f32[1,1024,512]{1,2,0} %add.16755), window={size=3 stride=3}, dim_labels=b0f_0io->b0f, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="conv_general_dilated" op_name="jit(single_device_update_fn)/conv_general_dilated[ batch_group_count=1\n dimension_numbers=ConvDimensionNumbers(lhs_spec=(2, 0, 1), rhs_spec=(2, 0, 1), out_spec=(1, 2, 0))\n feature_group_count=1\n lhs_dilation=(1,)\n lhs_shape=(1, 3072, 512)\n padding=((0, 0),)\n precision=None\n rhs_dilation=(3,)\n rhs_shape=(1, 1024, 512)\n window_strides=(1,) ]" source_file="/home/mtyrolski/vatican_trax_workspace/20201216_005356/venv/lib/python3.8/site-packages/trax/fastmath/jax.py" source_line=53}, backend_config="{\"algorithm\":\"0\",\"tensor_ops_enabled\":false,\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" failed. Falling back to default algorithm.
I tried a lot of proposed solutions from tensorflow issues like tensorflow/tensorflow#24496 but unfortunately none of them helps. Important note - the issue occurs if and only if we use Convolution layer in our model.
Description
I try to train the model on the cluster and constantly get an error as soon as the model starts training:
I tried a lot of proposed solutions from tensorflow issues like tensorflow/tensorflow#24496 but unfortunately none of them helps.
Important note - the issue occurs if and only if we use Convolution layer in our model.
Environment information
We use the newest version of the trax.
Steps to reproduce:
...
The text was updated successfully, but these errors were encountered: