Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TGAN crashing at Epoch 1 #43

Open
nabarunaguha opened this issue Oct 3, 2019 · 3 comments
Open

TGAN crashing at Epoch 1 #43

nabarunaguha opened this issue Oct 3, 2019 · 3 comments
Assignees
Labels
pending review this issue needs to be further reviewed, so work cannot be started

Comments

@nabarunaguha
Copy link

nabarunaguha commented Oct 3, 2019

Hi,
I am facing this issue for some time and not able to fix this.

  • Python version: 3.7
  • Operating System: Linux
  • TensorFlow version: 1.14.0
  • CUDA version: 10.0

Description

I keep getting this warning and then the execution crashes at Epoch 1.
TGAN uses CPU

What I Did

import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

And it shows tf is using GPU fine.

2019-10-03 13:11:01.720688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-03 13:11:01.768834: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2596780000 Hz
2019-10-03 13:11:01.771431: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157647a930 executing computations on platform Host. Devices:
2019-10-03 13:11:01.771460: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-10-03 13:11:01.772877: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-10-03 13:11:04.249822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.250926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.251999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.253103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.254193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.255276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.255566: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.256938: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.258142: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.258427: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.260019: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.261283: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.265096: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.277832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.277873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.284987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.285005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3 4 5
2019-10-03 13:11:04.285013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y Y N N
2019-10-03 13:11:04.285018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y Y N N
2019-10-03 13:11:04.285023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N Y N N
2019-10-03 13:11:04.285028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   Y Y Y N N N
2019-10-03 13:11:04.285033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4:   N N N N N Y
2019-10-03 13:11:04.285040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5:   N N N N Y N
2019-10-03 13:11:04.293727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.296282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.298803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.301310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.303979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.306456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
2019-10-03 13:11:04.310204: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56157ab4cab0 executing computations on platform CUDA. Devices:
2019-10-03 13:11:04.310223: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310229: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310234: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310239: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (3): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310244: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (4): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.310249: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (5): Tesla M60, Compute Capability 5.2
2019-10-03 13:11:04.314251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:04:00.0
2019-10-03 13:11:04.315484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:05:00.0
2019-10-03 13:11:04.316567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:09:00.0
2019-10-03 13:11:04.317632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:0a:00.0
2019-10-03 13:11:04.318705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:85:00.0
2019-10-03 13:11:04.319780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla M60 major: 5 minor: 2 memoryClockRate(GHz): 1.1775
pciBusID: 0000:86:00.0
2019-10-03 13:11:04.319806: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-10-03 13:11:04.319820: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-10-03 13:11:04.319833: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-10-03 13:11:04.319846: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-10-03 13:11:04.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-10-03 13:11:04.319872: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-10-03 13:11:04.319885: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-10-03 13:11:04.332488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2019-10-03 13:11:04.332811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-03 13:11:04.332823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3 4 5
2019-10-03 13:11:04.332830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y Y N N
2019-10-03 13:11:04.332835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y Y N N
2019-10-03 13:11:04.332840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N Y N N
2019-10-03 13:11:04.332845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   Y Y Y N N N
2019-10-03 13:11:04.332850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 4:   N N N N N Y
2019-10-03 13:11:04.332856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 5:   N N N N Y N
2019-10-03 13:11:04.340711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 7647 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:04:00.0, compute capability: 5.2)
2019-10-03 13:11:04.341796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 7647 MB memory) -> physical GPU (device: 1, name: Tesla M60, pci bus id: 0000:05:00.0, compute capability: 5.2)
2019-10-03 13:11:04.342889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:2 with 7647 MB memory) -> physical GPU (device: 2, name: Tesla M60, pci bus id: 0000:09:00.0, compute capability: 5.2)
2019-10-03 13:11:04.343989: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:3 with 7647 MB memory) -> physical GPU (device: 3, name: Tesla M60, pci bus id: 0000:0a:00.0, compute capability: 5.2)
2019-10-03 13:11:04.345103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:4 with 7647 MB memory) -> physical GPU (device: 4, name: Tesla M60, pci bus id: 0000:85:00.0, compute capability: 5.2)
2019-10-03 13:11:04.346189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:5 with 7647 MB memory) -> physical GPU (device: 5, name: Tesla M60, pci bus id: 0000:86:00.0, compute capability: 5.2)
Default GPU Device: /device:GPU:0

I set the argument of GPU in TGANModel to '/GPU:0' and also tried with '/device:GPU:0'

But, it is the same warning and the crash just while running the first epoch.

I also uninstalled and re-installed Tensorflow-gpu and TGAN, just to check but of no use.

Regards,
Nabaruna

@csala
Copy link
Contributor

csala commented Oct 4, 2019

Hi @nabarunaguha

Would you mind sharing a short code snippet that shows the exact arguments that you use when creating the TGAN instance and calling the fit and sample methods?

We will then try to reproduce the error to be able to assist you better.

Also, regarding the GPU usage, please check this other issue: #34

So, basically, the gpu argument is now being ignored, and all that matters in regards of GPU usage is whether you have installed tensorflow or tensorflow-gpu.

@csala csala added the pending review this issue needs to be further reviewed, so work cannot be started label Oct 4, 2019
@nabarunaguha
Copy link
Author

nabarunaguha commented Oct 4, 2019

Hi @csala ,

Yeah sure, here are my arguments.
from tgan.model import TGANModel
tgan = TGANModel(continuous_columns, output='output', gpu='/device:GPU:0', max_epoch=5, steps_per_epoch=150, save_checkpoints=False, restore_session=False, batch_size=50, z_dim=50, noise=0.2, l2norm=0.00001, learning_rate=0.001, num_gen_rnn=100, num_gen_feature=100, num_dis_layers=1, num_dis_hidden=100, optimizer='AdamOptimizer')

tgan.fit(data)
model_path = '/home/naguha/ModelSave/ModelCheck.pkl'
num_samples = 20868
samples = tgan.sample(num_samples)
export_csv = samples.to_csv(r'/home/naguha/Samples_TGAN.csv',index = None, header=True)

And I installed tensorflow-gpu==1.14

@lablebi96
Copy link

Hello, any news for this issue ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending review this issue needs to be further reviewed, so work cannot be started
Projects
None yet
Development

No branches or pull requests

4 participants