Cannot merge devices with incompatible jobs #16542

wangshuaizs · 2018-01-29T14:18:45Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04(kernel 4.10)
TensorFlow installed from (source or binary): source code
TensorFlow version (use command below): r1.4.0
Python version: 2.7
Bazel version (if compiling from source): 0.5.4
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: No CUDA
GPU model and memory: No GPU
Exact command to reproduce: var_rep = tf.Variable(var_concat, name=var_name, collections=[tf.GraphKeys.LOCAL_VARIABLES], trainable=False)

Describe the problem

I am running distributed tensorflow with 2 parameter servers and 2 workers. I created some partioned variables by using "tf.create_partitioned_variables()". So the different parts of a variable will be placed on different ps. Then I want to concatenate them by using "tf.concat()" and assign the result of tf.concat() to a untrainable variable which lies in worker. But the error occurs as follows:

Traceback (most recent call last):
  File "distributed_vgg19.py", line 230, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "distributed_vgg19.py", line 188, in main
    sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
    max_wait_secs=max_wait_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 400, in wait_for_session
    sess)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 483, in _try_run_local_init_op
    sess.run(self._local_init_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'fc8/fc8_biases_1/fc8/fc8_biases/part_1/read_fc8/fc8_biases_1_0' and 'fc8/fc8_biases/part_1: Cannot merge devices with incompatible jobs: '/job:ps/task:0' and '/job:worker/task:1'
         [[Node: fc8/fc8_biases_1/fc8/fc8_biases/part_1/read_fc8/fc8_biases_1_0 = Identity[T=DT_FLOAT, _class=["loc:@fc8/fc8_biases/part_1"], _device="/job:worker/task:1"](fc8/fc8_biases_1/cond_1/Merge)]]

Caused by op u'fc8/fc8_biases_1/fc8/fc8_biases/part_1/read_fc8/fc8_biases_1_0', defined at:
  File "distributed_vgg19.py", line 230, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "distributed_vgg19.py", line 136, in main
    vgg.build(x, train_mode)
  File "/home/shuai/distributed_vgg19_part/vgg19.py", line 96, in build
    self.fc8 = self.fc_layer(self.relu7, 4096, 1000, "fc8")
  File "/home/shuai/distributed_vgg19_part/vgg19.py", line 120, in fc_layer
    weights, biases = self.get_fc_var(in_size, out_size, name)
  File "/home/shuai/distributed_vgg19_part/vgg19.py", line 143, in get_fc_var
    biases = self.get_var(initial_value, name, 1, name + "_biases")
  File "/home/shuai/distributed_vgg19_part/vgg19.py", line 171, in get_var
    var_rep = tf.Variable(var_concat, name=var_name, collections=[tf.GraphKeys.LOCAL_VARIABLES], trainable=False)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 229, in __init__
    constraint=constraint)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 378, in _init_from_args
    self._build_initializer_expr(self._initial_value),
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 811, in _build_initializer_expr
    new_op = self._build_initializer_expr(initial_value.op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 830, in _build_initializer_expr
    new_tensor = self._build_initializer_expr(tensor)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 811, in _build_initializer_expr
    new_op = self._build_initializer_expr(initial_value.op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 847, in _build_initializer_expr
    attrs=initial_value.node_def.attr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3042, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1521, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'fc8/fc8_biases_1/fc8/fc8_biases/part_1/read_fc8/fc8_biases_1_0' and 'fc8/fc8_biases/part_1: Cannot merge devices with incompatible jobs: '/job:ps/task:0' and '/job:worker/task:1'
         [[Node: fc8/fc8_biases_1/fc8/fc8_biases/part_1/read_fc8/fc8_biases_1_0 = Identity[T=DT_FLOAT, _class=["loc:@fc8/fc8_biases/part_1"], _device="/job:worker/task:1"](fc8/fc8_biases_1/cond_1/Merge)]]

there are some relevant code:

            slice_list = []
            for dim_index in range(value.get_shape().ndims):
                if dim_index == 0:
                    slice_list.append(self.num_of_ps)
                else:
                    slice_list.append(1)
            var_list = tf.create_partitioned_variables(shape=value.get_shape(), 
                                                       slicing=slice_list, 
                                                       initializer=value,
                                                       name=var_name)
            var_concat = var_list[0]
            for ps_index in range(self.num_of_ps - 1):
                var_concat = tf.concat([var_concat, var_list[ps_index + 1]], 0)
            with tf.device('/job:worker/task:%d' % self.task_index):
                var = tf.Variable(var_concat, name=var_name, collections=[tf.GraphKeys.LOCAL_VARIABLES], trainable=False)

Anybody knows why this error happened? Is it caused by an bug? If not, what i can do to assign the value of a variable lying in ps to another variable lying in worker?
Thanks a lot!

The text was updated successfully, but these errors were encountered:

wangshuaizs · 2018-01-29T14:22:44Z

BTW, the result of concatenating, var_concat, is a tensor lying in worker, so I think this assign op has nothing to do with '/job:ps/task:0', is right?

drpngx · 2018-01-30T02:35:46Z

@alextp does that ring a bell?

alextp · 2018-01-30T17:37:01Z

I'm confused, why are you creating a variable to store the result of the concatenation of other variables? Why not just use partitioned_variable.as_tensor() which returns a tensor with the concatenated value of all the partitions?

tensorflowbutler · 2018-02-14T13:23:01Z

Nagging Assignee: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-03-03T07:48:44Z

Nagging Assignee @alextp: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-03-17T14:54:54Z

Nagging Assignee @alextp: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-04-01T12:30:50Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-04-16T12:30:33Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

wangshuaizs · 2018-04-16T15:03:35Z

It has been solved. So it can be closed. Thank you! 在2018-04-16 20:37:08，Shuai Wang13211134@bjtu.edu.cn写道： It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

fengrussell · 2018-04-26T07:44:47Z

@wangshuaizs Could you share your solution?

felixhao28 · 2018-04-28T07:29:48Z

@fengrussell
I followed instructions here to add allow_soft_placement=True. Not sure what is the side effect.
#2285

tensorflowbutler · 2018-05-13T18:28:27Z

Nagging Assignee @alextp: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-05-28T18:34:27Z

Nagging Assignee @alextp: It has been 30 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

drpngx assigned alextp Jan 30, 2018

alextp added the stat:awaiting response Status - Awaiting response from author label Feb 14, 2018

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Apr 17, 2018

alextp closed this as completed May 29, 2018

leewyang mentioned this issue Apr 29, 2019

[keras Embedding] Cannot colocate nodes .. Cannot merge devices with incompatible jobs yahoo/TensorFlowOnSpark#407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot merge devices with incompatible jobs #16542

Cannot merge devices with incompatible jobs #16542

wangshuaizs commented Jan 29, 2018 •

edited by drpngx

wangshuaizs commented Jan 29, 2018

drpngx commented Jan 30, 2018

alextp commented Jan 30, 2018

tensorflowbutler commented Feb 14, 2018

tensorflowbutler commented Mar 3, 2018

tensorflowbutler commented Mar 17, 2018

tensorflowbutler commented Apr 1, 2018

tensorflowbutler commented Apr 16, 2018

wangshuaizs commented Apr 16, 2018 via email

fengrussell commented Apr 26, 2018

felixhao28 commented Apr 28, 2018

tensorflowbutler commented May 13, 2018

tensorflowbutler commented May 28, 2018

Cannot merge devices with incompatible jobs #16542

Cannot merge devices with incompatible jobs #16542

Comments

wangshuaizs commented Jan 29, 2018 • edited by drpngx

System information

Describe the problem

wangshuaizs commented Jan 29, 2018

drpngx commented Jan 30, 2018

alextp commented Jan 30, 2018

tensorflowbutler commented Feb 14, 2018

tensorflowbutler commented Mar 3, 2018

tensorflowbutler commented Mar 17, 2018

tensorflowbutler commented Apr 1, 2018

tensorflowbutler commented Apr 16, 2018

wangshuaizs commented Apr 16, 2018 via email

fengrussell commented Apr 26, 2018

felixhao28 commented Apr 28, 2018

tensorflowbutler commented May 13, 2018

tensorflowbutler commented May 28, 2018

wangshuaizs commented Jan 29, 2018 •

edited by drpngx