Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in train.py #22

Open
chiolee opened this issue Jul 16, 2018 · 3 comments
Open

Error in train.py #22

chiolee opened this issue Jul 16, 2018 · 3 comments

Comments

@chiolee
Copy link

chiolee commented Jul 16, 2018

Hi, very thanks for your project and you effort !!
Do you have any idea, why train.py doesn't work. ? I have tensorflow 1.4.0. Thank you very much!

pci bus id: 0000:00:0a.0, compute capability: 6.0)
2018-07-16 10:10:26.461140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0b.0, compute capability: 6.0)
2018-07-16 10:10:26.461146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0c.0, compute capability: 6.0)
2018-07-16 10:10:26.461151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:0d.0, compute capability: 6.0)
restore and continue training!
244.201 sec
520.449 sec
Model saved in file: s3://bucket-7601/models/model.ckpt-1001
523.704 sec
Model saved in file: s3://bucket-7601/models/model.ckpt-1152
Traceback (most recent call last):
File "train/train_tf.py", line 147, in
log_dir=args.log_path, start_lr=args.learning_rate, wd=args.weight_decay, kp=args.keep_prob)
File "train/train_tf.py", line 125, in run_training
coord.join(threads)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1231, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: truncated record at 2904972900
[[Node: input/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](input/TFRecordReaderV2, input/input_producer)]]

@weiyuxingchen
Copy link

Hello, has this problem been solved? I have the same problem.

@chiolee
Copy link
Author

chiolee commented Dec 12, 2018

Hello, has this problem been solved? I have the same problem.
sorry, I haven't solved the problem so far

@gmt710
Copy link

gmt710 commented Dec 9, 2020

Hi, how you use mutil gpu ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants