Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems running code with GPU support. #10

Closed
nick-torenvliet opened this issue May 12, 2021 · 4 comments
Closed

Problems running code with GPU support. #10

nick-torenvliet opened this issue May 12, 2021 · 4 comments

Comments

@nick-torenvliet
Copy link

Hi,

I'm in a tensorflow 1.15.0, with python 3 docker container. GPUs seem to be working fine for simple test tasks e.g. tf can see
four gpus and I can load the up.

The requirements are all installed.

When I run
CUDA_VISIBLE_DEVICES=* python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Everything works fine - it cylces through calculation - but it runs on CPU only.

When I run anything else e.g.
python train.py --model_type gp-vae --data_type physionet --exp_name asdf
or
CUDA_VISIBLE_DEVICES=1 python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Of any of the reference parameters sets, it bails as per below.
Its not batch size... I've already played with that.

Any ideas?

2021-05-12 21:11:18.088886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-05-12 21:11:18.088901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-12 21:11:18.088915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-05-12 21:11:18.088928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-05-12 21:11:18.088941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-12 21:11:18.088954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-05-12 21:11:18.088970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-12 21:11:18.089512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-05-12 21:11:18.089553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-12 21:11:18.089562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-05-12 21:11:18.089570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-05-12 21:11:18.090103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 10320 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5)
GPU support: True
Training...
2021-05-12 21:11:18.097352: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x67c3e70
2021-05-12 21:11:18.097415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-05-12 21:11:18.447486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-05-12 21:11:18.735398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-05-12 21:11:20.163447: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2021-05-12 21:11:20.170211: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "train.py", line 473, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 239, in main
trainable_vars = model.get_trainable_vars()
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 325, in get_trainable_vars
tf.zeros(shape=(1, self.time_length, self.data_dim), dtype=tf.float32))
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 332, in compute_loss
return self._compute_loss(x, m_mask=m_mask, return_parts=return_parts)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 280, in _compute_loss
qz_x = self.encode(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 220, in encode
return self.encoder(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 51, in call
mapped = self.net(x)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/sequential.py", line 269, in call
outputs = layer(inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 387, in call
return super(Conv1D, self).call(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 197, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1134, in call
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 639, in call
return self.call(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 238, in call
name=self.name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 227, in _conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1031, in conv2d
data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1130, in conv2d_eager_fallback
ctx=_ctx, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D] name: sequential/conv1d/conv1d/

@nick-torenvliet
Copy link
Author

Is this due to tensroflow being bumped up - as per the closed issues?

@dbaranchuk
Copy link
Member

Hi,

It sounds like the problem is due to incompatible tf version / cuda / cudnn / GPU model. Probably, one can try this solution: tensorflow/tensorflow#24496 (comment) , it's a pretty popular problem. As far as I remember, I've successfully tested the current configuration about a year ago.

@nick-torenvliet
Copy link
Author

Thanks for that,

Got it running on 1.15.0 with the items listed in requirements.txt on a docker container.

As per the link you included, I added:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)

Just after:
tf.compat.v1.enable_eager_execution()

in train.py

It appears to be running on a single GPU.

Does this model run on multiple GPU?

@dbaranchuk
Copy link
Member

I don't think so. We didn't run multi-gpu training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants