Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark with cross barrier error #427

Open
panpanli521 opened this issue Feb 8, 2022 · 0 comments
Open

benchmark with cross barrier error #427

panpanli521 opened this issue Feb 8, 2022 · 0 comments

Comments

@panpanli521
Copy link

panpanli521 commented Feb 8, 2022

I benchmarked the performance of BytePS with cross barrier using the script in /example/pytorch/benchmark_cross_barrier_byteps.py.

The complete commands as follows:

  • scheduler:

export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip1 bpslaunch

  • sever1:
    export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip1 bpslaunch

  • sever2:
    export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip2 bpslaunch

  • worker1
    export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 # the scheduler port export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip3 bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_cross_barrier_byteps.py --model resnet50 --batch-size 64 --num-iters 500

  • worker2
    export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip4 bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_cross_barrier_byteps.py --model resnet50 --batch-size 64 --num-iters 500

After executing the command, worker1 can print throughout but worker2 is hanging:
image

Finished:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant