Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU 100%, but training never starts #109

Open
AmitMY opened this issue Jul 25, 2019 · 2 comments
Open

CPU 100%, but training never starts #109

AmitMY opened this issue Jul 25, 2019 · 2 comments

Comments

@AmitMY
Copy link

AmitMY commented Jul 25, 2019

I ran:

python3 train.py  \
    --X=data/tfrecords/hands_dirty.tfrecords \
    --Y=data/tfrecords/hands_clean.tfrecords \
    --image_size=368

Which gave me lots of output. Put the CPU on 100%, and I've been waiting to see training steps for an hour now, but training does not seem to start.

Is there a reason for it to be frozen?

(Not only I don't see steps in the terminal, the TensorBoard is empty as well)

2019-07-25 17:44:59.211092: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.213841: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-07-25 17:44:59.216295: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-07-25 17:44:59.216873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-07-25 17:44:59.220049: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-07-25 17:44:59.222446: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-07-25 17:44:59.269580: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-25 17:44:59.288028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3 2019-07-25 17:44:59.288551: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-07-25 17:44:59.907510: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555ff98ec9f0 executing computations on platform CUDA. Devices: 2019-07-25 17:44:59.907590: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907617: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907637: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907657: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.913152: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200045000 Hz 2019-07-25 17:44:59.919999: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555ff819b190 executing computations on platform Host. Devices: 2019-07-25 17:44:59.920055: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-07-25 17:44:59.926804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:02:00.0 2019-07-25 17:44:59.929084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:03:00.0 2019-07-25 17:44:59.931260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:81:00.0 2019-07-25 17:44:59.933309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:82:00.0 2019-07-25 17:44:59.933387: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.933422: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-07-25 17:44:59.933451: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-07-25 17:44:59.933480: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-07-25 17:44:59.933509: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-07-25 17:44:59.933537: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-07-25 17:44:59.933567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-25 17:44:59.949195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3 2019-07-25 17:44:59.949264: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.957493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-25 17:44:59.957530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 2019-07-25 17:44:59.957555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y N N 2019-07-25 17:44:59.957596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N N N 2019-07-25 17:44:59.957611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: N N N Y 2019-07-25 17:44:59.957626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: N N Y N 2019-07-25 17:44:59.966648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2019-07-25 17:44:59.969143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2019-07-25 17:44:59.971469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1) 2019-07-25 17:44:59.974262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1) 2019-07-25 17:45:02.640893: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_c ache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. W0725 17:45:03.616461 139655837968192 deprecation.py:323] From train.py:83: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module.
--

@philthestone
Copy link

I ran into the same problem. Did you find a solution?

@AmitMY
Copy link
Author

AmitMY commented Jul 15, 2021

Not that I recall, sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants