[WIP] distributed training #1334

ZiyueHuang · 2020-09-01T10:08:59Z

Description

Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each has multiple GPU cards. However, there are two issues:

We should first call trainer._init_params() before the first iteration (forward/backward) to synchronize the parameters across all workers, otherwise trainer will call _init_params inside allreduce_grads (see mxnet/python/mxnet/gluon/trainer.py), thus the synchronization of the parameters actually happen after the first forward/backward computation, meaning that in the first iteration the gradients on different workers are computed w.r.t. different parameters. But in practice this may not be a severe problem, as only the first gradient descent step is not totally correct.
When the model in gluon-nlp confirms to the new coding standard (removing defer_init for all parameters @sxjscience ), then we can directly call trainer._init_params after model.initialize.
Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.
I found that we have to build MXNet and BytePS from source on our target machine. Using pip install mxnet (I have tried several wheels in August) will not work for BytePS, either core dump immediately after import bps.mxnet due to undefined symbols (seems because some symbols are not exported into binary, fixed after 0820 wheel), or mysterious segfault (maybe due to c++ ABI issues or other compiler issues, as the mxnet wheels and BytePS (which is setup on our target machine) may not be compiled using the same version of gcc. Besides, there are no related compiler flags such as D_GLIBCXX_USE_CXX11_ABI for MXNet in BytePS's setup.py).

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

zheyuye

Looks good and building MXNet from source will work for distributed training

codecov · 2020-09-01T11:46:41Z

Codecov Report

Merging #1334 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1334   +/-   ##
=======================================
  Coverage   67.34%   67.34%           
=======================================
  Files           2        2           
  Lines          98       98           
=======================================
  Hits           66       66           
  Misses         32       32

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c076093...ec793d5. Read the comment docs.

ZiyueHuang · 2020-09-01T12:03:50Z

Just found that the API behavior in BytePS master branch changes recently...... Not sure if that is intended or bugs. Tracked here (bytedance/byteps#292).

sxjscience · 2020-09-01T17:40:10Z

@ZiyueHuang You may merge the upstream/master since we recently fixed the CI.

szha · 2020-09-01T19:17:18Z

undefined symbols

What are they?

szhengac · 2020-09-01T21:06:03Z

Horovod also has similar issue that the actual broadcast happens after the first iteration. See here

sxjscience · 2020-09-01T21:09:29Z

To avoid the problems of horovod, in GluonNLP v1, the convention is to not rely on deferred initialization when implementing the model. For example, for dense layer, we should always give the in_units.

ZiyueHuang · 2020-09-03T04:12:10Z

Sorry for the late reply.

@szha core dump due to undefined symbols is fixed after 0820 wheel, and I didn't record the undefined symbols. For the segfault, below is the stack trace

Fatal Error: Segmentation fault
Stack trace:
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.4', 'bpslaunch')
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1464, in run_script
    exec(code, namespace, namespace)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 213, in <module>
    launch_bps()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 206, in launch_bps
    t[i].join()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 34, in join
    raise self.exc
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/byteps-0.2.4-py3.7-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 177, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/home/ubuntu/anaconda3/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python ./test.py' returned non-zero exit status 139.

@sxjscience @szhengac Horovod doesn't have similar issue, since in hvd.broadcast_parameters when there are defer_init parameters, it will register a hook to param.init_impl (see https://github.com/horovod/horovod/blob/9464e20ab2565f25105d5d7f6194ca90d6fca9e5/horovod/mxnet/__init__.py#L121-L127), thus immediately after the actual initialization, Horovod will broadcast the parameters, then forward/backward. However, there are no similar mechanisms in BytePS to handle the logic of defer_init.

…train

leezu · 2020-09-04T16:23:11Z

Actually I am a little confused about the semantic meaning of defer_init in the numpy mode, because shape with zeros (such as (0, 5)) in the numpy mode will be treated as scalar tensors' shape and zero-size tensors' shape, instead of as unknown in the legacy mode.

You can report it as a bug in MXNet issue tracker

distributed training

b96c07d

ZiyueHuang requested a review from a team as a code owner September 1, 2020 10:08

zheyuye reviewed Sep 1, 2020

View reviewed changes

Merge branch 'master' of https://github.com/dmlc/gluon-nlp into dist_…

ec793d5

…train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] distributed training #1334

[WIP] distributed training #1334

ZiyueHuang commented Sep 1, 2020 •

edited

zheyuye left a comment •

edited

codecov bot commented Sep 1, 2020 •

edited

ZiyueHuang commented Sep 1, 2020

sxjscience commented Sep 1, 2020

szha commented Sep 1, 2020

szhengac commented Sep 1, 2020

sxjscience commented Sep 1, 2020

ZiyueHuang commented Sep 3, 2020 •

edited

leezu commented Sep 4, 2020

[WIP] distributed training #1334

Are you sure you want to change the base?

[WIP] distributed training #1334

Conversation

ZiyueHuang commented Sep 1, 2020 • edited

Description

Checklist

Essentials

Changes

Comments

zheyuye left a comment • edited

Choose a reason for hiding this comment

codecov bot commented Sep 1, 2020 • edited

Codecov Report

ZiyueHuang commented Sep 1, 2020

sxjscience commented Sep 1, 2020

szha commented Sep 1, 2020

szhengac commented Sep 1, 2020

sxjscience commented Sep 1, 2020

ZiyueHuang commented Sep 3, 2020 • edited

leezu commented Sep 4, 2020

ZiyueHuang commented Sep 1, 2020 •

edited

zheyuye left a comment •

edited

codecov bot commented Sep 1, 2020 •

edited

ZiyueHuang commented Sep 3, 2020 •

edited