Problems running code with GPU support. #10

nick-torenvliet · 2021-05-12T21:21:49Z

Hi,

I'm in a tensorflow 1.15.0, with python 3 docker container. GPUs seem to be working fine for simple test tasks e.g. tf can see
four gpus and I can load the up.

The requirements are all installed.

When I run
CUDA_VISIBLE_DEVICES=* python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Everything works fine - it cylces through calculation - but it runs on CPU only.

When I run anything else e.g.
python train.py --model_type gp-vae --data_type physionet --exp_name asdf
or
CUDA_VISIBLE_DEVICES=1 python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Of any of the reference parameters sets, it bails as per below.
Its not batch size... I've already played with that.

Any ideas?

2021-05-12 21:11:18.088886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-12 21:11:18.088901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-12 21:11:18.088915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-12 21:11:18.088928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-12 21:11:18.088941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-12 21:11:18.088954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-12 21:11:18.088970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-12 21:11:18.089512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-12 21:11:18.089553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-12 21:11:18.089562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-05-12 21:11:18.089570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-05-12 21:11:18.090103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 10320 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
GPU support: True
Training...
2021-05-12 21:11:18.097352: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x67c3e70
2021-05-12 21:11:18.097415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-12 21:11:18.447486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-12 21:11:18.735398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-12 21:11:20.163447: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-05-12 21:11:20.170211: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "train.py", line 473, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 239, in main
trainable_vars = model.get_trainable_vars()
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 325, in get_trainable_vars
tf.zeros(shape=(1, self.time_length, self.data_dim), dtype=tf.float32))
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 332, in compute_loss
return self._compute_loss(x, m_mask=m_mask, return_parts=return_parts)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 280, in _compute_loss
qz_x = self.encode(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 220, in encode
return self.encoder(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 51, in call
mapped = self.net(x)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/sequential.py", line 269, in call
outputs = layer(inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 387, in call
return super(Conv1D, self).call(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 197, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1134, in call
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 639, in call
return self.call(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 238, in call
name=self.name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 227, in _conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1031, in conv2d
data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1130, in conv2d_eager_fallback
ctx=_ctx, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D] name: sequential/conv1d/conv1d/

nick-torenvliet · 2021-05-13T10:58:09Z

Is this due to tensroflow being bumped up - as per the closed issues?

dbaranchuk · 2021-05-13T11:12:57Z

Hi,

It sounds like the problem is due to incompatible tf version / cuda / cudnn / GPU model. Probably, one can try this solution: tensorflow/tensorflow#24496 (comment) , it's a pretty popular problem. As far as I remember, I've successfully tested the current configuration about a year ago.

nick-torenvliet · 2021-05-13T14:42:48Z

Thanks for that,

Got it running on 1.15.0 with the items listed in requirements.txt on a docker container.

As per the link you included, I added:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)

Just after:
tf.compat.v1.enable_eager_execution()

in train.py

It appears to be running on a single GPU.

Does this model run on multiple GPU?

dbaranchuk · 2021-05-14T09:06:31Z

I don't think so. We didn't run multi-gpu training

dbaranchuk closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems running code with GPU support. #10

Problems running code with GPU support. #10

nick-torenvliet commented May 12, 2021

nick-torenvliet commented May 13, 2021

dbaranchuk commented May 13, 2021

nick-torenvliet commented May 13, 2021

dbaranchuk commented May 14, 2021

Problems running code with GPU support. #10

Problems running code with GPU support. #10

Comments

nick-torenvliet commented May 12, 2021

nick-torenvliet commented May 13, 2021

dbaranchuk commented May 13, 2021

nick-torenvliet commented May 13, 2021

dbaranchuk commented May 14, 2021