Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

Open
anseey opened this issue Dec 30, 2016 · 3 comments

Comments

@anseey
Copy link

anseey commented Dec 30, 2016

distributed/cancer_classifier.py works in only one docker container.

It works in one container:

# both in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=ps --task_index=0
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=worker --task_index=0

But it not work in two containers:

# ps in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=ps --task_index=0`
# worker in 127.17.0.4
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=worker --task_index=0

the error msg I got in the worker:

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> 127.17.0.3:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:8222
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 91acfc1008531f4d with config:

Traceback (most recent call last):
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 209, in main
    with sv.managed_session(server.target) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 720, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
	 [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]

Caused by op u'save/RestoreV2_8', defined at:
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 191, in main
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
	 [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]
@anseey anseey closed this as completed Dec 31, 2016
@anseey anseey reopened this Dec 31, 2016
@tobegit3hub
Copy link
Owner

It may be the bug of the script for distributed training with latest TensorFlow.

I will refactor the code for distributed soon. If you want to run distributed TensorFlow application, please try tobegit3hub/distributed_tensorflow which is much better now.

@anseey
Copy link
Author

anseey commented Jan 3, 2017

@tobegit3hub Thank you!
I have tried tobegit3hub/distributed_tensorflow, it works!
But it still has the problem https://github.com/tensorflow/tensorflow/issues/5110

@tobegit3hub
Copy link
Owner

Yes, that's something we're working for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants