Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does TensorFlow1x support asycn-training? #404

Open
jiahuiyang opened this issue Jul 27, 2021 · 2 comments
Open

Does TensorFlow1x support asycn-training? #404

jiahuiyang opened this issue Jul 27, 2021 · 2 comments

Comments

@jiahuiyang
Copy link

Dear All,
Does TensorFlow1x support asycn-training?
I tried BytePS asycn-training with tensorflow mnist example. After one batch update with server, weights becomes zeros in worker.

@eric-haibin-lin
Copy link
Collaborator

@ymjiang

@jiahuiyang
Copy link
Author

jiahuiyang commented Jul 29, 2021

@ymjiang
Hi,haibin and yimin
I have two problems in asy-training
The first one is that delta_w sended to severs are zeros all the time. It seems old_tensors changed as vars changed in tensorflow/_init_.py.

  def apply_gradients(self, *args, **kwargs):
            """Calls this same method on the underlying optimizer."""
            if self._enable_async: # async training
                grads_and_vars = args[0]
                _, vars = zip(*grads_and_vars)
                old_tensors = []
                for var in vars:
                    old_tensors.append(tf.convert_to_tensor(var))
                apply_ops = self._optimizer.apply_gradients(*args, **kwargs)
                with tf.control_dependencies([apply_ops]):
                    # get the delta
                    for i, var in enumerate(vars):
                        old_tensors[i] = tf.subtract(var, old_tensors[i])

                    # reuse the _push_pul_grads(), but is transferring parameters
                    updated_tensors = self._push_pull_grads(old_tensors)

                    # copy the updated variable back
                    assign_op_list = []
                    for i, tensor in enumerate(updated_tensors):
                        assign_op_list.append(tf.assign(vars[i], tensor))

                return control_flow_ops.group(*assign_op_list)
            else:
                return self._optimizer.apply_gradients(*args, **kwargs)

The second one is the tensor's full name to be declared are different between broadcast section and training section. It seems the weight and delta_weight won't be summed because they have different declared key. Pls check def _push_pull(tensor, scope='', name=None) def broadcast(tensor, root_rank, scope='', name=None, is_variable=True): in ops.py.

If I missunderstood something, pls shed light on it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants