Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

Open
ruipeterpan opened this issue Jan 18, 2021 · 25 comments · May be fixed by #359
Open

Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server #357

ruipeterpan opened this issue Jan 18, 2021 · 25 comments · May be fixed by #359

Comments

@ruipeterpan
Copy link

Describe the bug

Turning on asynchronous training (export BYTEPS_ENABLE_ASYNC=1) crashes the bps server (during SendPullResponse in byteps/server/server.cc)

Expected behavior

The expected behavior is for the training to go error-free, just like in synchronous training

Stack trace from the crashed server

These are produced by turning on BYTEPS_ENABLE_GDB, setting BYTEPS_LOG_LEVEL to INFO and PS_VERBOSE to 2

root@025575bc0d1646e18ff8810e88261f03000000:/# bpslaunch
BytePS launching server
Command: gdb -ex 'run' -ex 'bt' -batch --args python3.8 -c 'import byteps.server'

warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-01-18 03:00:55.980567: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-01-18 03:00:55.980615: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-01-18 03:00:55.980635: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-01-18 03:00:55.980650: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-01-18 03:00:55.980663: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[03:00:55] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[03:00:55] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[New Thread 0x7f0da1300700 (LWP 27)]
[03:00:55] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[03:00:55] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[New Thread 0x7f0da0aff700 (LWP 28)]
[New Thread 0x7f0da02fe700 (LWP 29)]
[New Thread 0x7f0d9fafd700 (LWP 30)]
[New Thread 0x7f0d9f2fc700 (LWP 31)]
[New Thread 0x7f0d9eafb700 (LWP 32)]
[New Thread 0x7f0d9e2fa700 (LWP 33)]
[[03:00:55] src/van.cc:389: Bind to role=server, ip=10.0.0.4, port=33261, is_recovery=0
03:00:55] src/./zmq_van.h:287: Start ZMQ recv thread
[New Thread 0x7f0d9daf9700 (LWP 34)]
[03:00:55] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.0.0.4, port=33261, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:503: 1 => 2147483647. Meta: request=0, timestamp=4, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.0.0.8, port=60029, is_recovery=0 role=worker, id=11, ip=10.0.0.6, port=39183, is_recovery=0 role=worker, id=13, ip=10.0.0.5, port=60011, is_recovery=0 role=server, id=8, ip=10.0.0.4, port=33261, is_recovery=0 role=worker, id=15, ip=10.0.0.4, port=52559, is_recovery=0 role=scheduler, id=1, ip=10.0.0.4, port=12345, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:338: S[8] is connected to others
[New Thread 0x7f0d9d2f8700 (LWP 35)]
[03:01:26] src/van.cc:478: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[Thread 0x7f0d9d2f8700 (LWP 35) exited]
[03:01:27] src/van.cc:503: 1 => 8. Meta: request=0, timestamp=10, control={ cmd=BARRIER, barrier_group=0 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:478: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[New Thread 0x7f0d9caf7700 (LWP 36)]
[New Thread 0x7f0d7ffff700 (LWP 37)]
[New Thread 0x7f0d7f7fe700 (LWP 38)]
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]


[terminate called after throwing an instance of '03:01:27] src/van.cc:dmlc::Error'
503:   what():  13 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]


[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0

Thread 2 "python3.8" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f0da1300700 (LWP 27)]
0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f0da3b5c8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f0da1c69957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f0da1c6fae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f0da1c6fb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f0da1c6fd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f0da25bcf27 in dmlc::LogMessageFatal::~LogMessageFatal (this=0x7f0da12ff310, __in_chrg=<optimized out>) at 3rdparty/ps-lite/include/dmlc/logging.h:277
#7  0x00007f0da25b6472 in byteps::server::SendPullResponse (type=..., type@entry=..., key=<optimized out>, req_meta=..., server=server@entry=0x2018b60) at byteps/server/server.cc:52
#8  0x00007f0da25b8562 in byteps::server::BytePSHandler (req_meta=..., req_data=..., server=0x2018b60) at byteps/server/server.cc:376
#9  0x00007f0da25beb8e in std::function<void (ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)>::operator()(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) const (__args#2=<optimized out>, __args#1=..., __args#0=..., this=0x2018dc8) at /usr/include/c++/7/bits/std_function.h:706
#10 ps::KVServer<char>::Process (this=0x2018b60, msg=...) at 3rdparty/ps-lite/include/ps/kv_app.h:516
#11 0x00007f0da25dbd89 in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=..., this=0x2018e88) at /usr/include/c++/7/bits/std_function.h:706
#12 ps::Customer::Receiving (this=0x2018e80) at src/customer.cc:64
#13 0x00007f0da1c9a6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0da39046db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f0da3c3da3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
root@025575bc0d1646e18ff8810e88261f03000000:/#

To Reproduce

Steps to reproduce the behavior:

  1. Build the docker image using a local Dockerfile. The Dockerfile is modified from the one provided in the repo with a few changes. The BytePS I used in the Dockerfile is vanilla BytePS (69a3d), only with some extra print statements.
  2. Set up a 4-node cluster. In my case, I used Azure ML for a 4-node cluster (STANDARD_NC24, 4 x NVIDIA Tesla K80).
  3. Start 4 worker containers (one each on the four nodes), 1 server container and 1 scheduler container on node 0. Then, run the training scripts. Note that these steps are exactly the same as those in [Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve? #356 except that the async training knob (BYTEPS_ENABLE_ASYNC) is turned on. For the exact steps, see below

building the docker image

# cd into the directory with Dockerfile and do
export BYTEPS_WITHOUT_MXNET=1; export BYTEPS_WITHOUT_TENSORFLOW=1; sudo docker build -t bytepsimage/pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

byteps set up

# scheduler on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=scheduler --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name scheduler bytepsimage/pytorch

# server on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=server --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name server bytepsimage/pytorch

# 4 workers on node 0-3
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=0 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker0 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=1 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker1 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=2 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker2 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=3 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker3 bytepsimage/pytorch

# for each node, go inside the container (server, scheduler, worker 0, ..., worker 3)
sudo docker exec -it {replace_me_with_name_of_container} bash

# use bpslaunch to start the server/scheduler. For the workers, 
bpslaunch python3.8 bps_issue.py --bps -pindex 0
bpslaunch python3.8 bps_issue.py --bps -pindex 1
bpslaunch python3.8 bps_issue.py --bps -pindex 2
bpslaunch python3.8 bps_issue.py --bps -pindex 3

Environment (please complete the following information):

  • OS: Linux 4.15.0-1100-azure x86_64
  • GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.12
  • CUDA and NCCL version: CUDA 11.0, NCCL 2.7.8-1
  • Framework (TF, PyTorch, MXNet): PyTorch

A few other things

  • Thank you in advance for your time to look at this issue, and thank you for open-sourcing this and being so responsive in the issues section!
@ymjiang
Copy link
Member

ymjiang commented Jan 18, 2021

For now, the asynchronous mode for PyTorch is supported only when you use the DistributedOptimizer approach, like this example. Your code now uses the DDP wrapper, for which we haven't implemented the async-mode yet.

@ruipeterpan
Copy link
Author

Hey @ymjiang thanks for the info! Nevertheless, after switching to using benchmark_byteps.py, the issue is still there.

FYI:

  • Here's the updated Dockerfile, the only difference with the official one being the cudnn version (8.0.5.39-1+cuda11.0) and nccl version (2.7.8-1+cuda11.0)
  • I used bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 1000000 on the workers
  • The stack trace from the server:
root@536f767106da4e48b0f29957b21b64da000000:/# bpslaunch
BytePS launching server
Command: python3 -c 'import byteps.server'

[06:22:03] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[06:22:03] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[06:22:03] src/postoffice.cc:25: Creating Van: zmq
[06:22:03] src/./zmq_van.h:299: Start ZMQ recv thread
[06:22:15] 3rdparty/ps-lite/include/dmlc/logging.h:276: [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fe2ba4d01b2]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fe2ba4d22d2]
[bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e]
[bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [06:22:15] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fe2ba4d699c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fe2ba4d6ddd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fe2ba4d01b2]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fe2ba4d22d2]
[bt] (4) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x24b9e) [0x7fe2ba4d8b9e]
[bt] (5) /usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x41781) [0x7fe2ba4f5781]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7fe2b9bb46df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fe2bb7e56db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fe2bbb1ea3f]


Aborted (core dumped)
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 658, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1438, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 220, in <module>
    launch_bps()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 216, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

@jasperzhong
Copy link
Contributor

jasperzhong commented Jan 18, 2021

This error indicates that you did not kill the previous bpslaunch process. @ruipeterpan

pkill bpslaunch; pkill python3 

@ruipeterpan
Copy link
Author

@vycezhong Thanks for your help. I double-checked to make sure no bps-related processes are alive (both inside and outside of all containers) before launching the server, yet it still crashes. Am I doing something dumb?
image

@jasperzhong
Copy link
Contributor

@ruipeterpan Did you launch any worker in your example above? The server should do nothing but wait when there is no worker.

@ruipeterpan
Copy link
Author

ruipeterpan commented Jan 19, 2021

@vycezhong If only the server gets launched, it starts ZMQ recv thread and waits w/o an error. As soon as the workers are launched, the server crashes.

@jasperzhong
Copy link
Contributor

OK I can reproduce it. I will look at it.

@ymjiang
Copy link
Member

ymjiang commented Jan 19, 2021

@vycezhong Could this problem be related to #225?

My experience is that v0.2.4 worked well (e.g., see #271).

@jasperzhong
Copy link
Contributor

@vycezhong Could this problem be related to #225?

My experience is that v0.2.4 worked well (e.g., see #271).

Yes. It is because I use update_buf for pulling. I will fix it then.

@jasperzhong jasperzhong linked a pull request Jan 19, 2021 that will close this issue
@ymjiang
Copy link
Member

ymjiang commented Jan 20, 2021

@ruipeterpan Could you please test if #359 fixes your issue?

@ruipeterpan
Copy link
Author

ruipeterpan commented Jan 20, 2021

@vycezhong thanks for the fix! The server-crashing issue is resolved by #359, but I'm seeing some weird behavior for the training loss curve after applying the changes in the PR. I'll spend some time double-checking to make sure it's not a problem on my end.
Screen Shot 2021-01-20 at 10 51 44 AM

@jasperzhong
Copy link
Contributor

@ruipeterpan You also need to enable async for workers.

@ruipeterpan
Copy link
Author

@ruipeterpan You also need to enable async for workers.

I had already toggled BYTEPS_ENABLE_ASYNC for all workers, servers & the scheduler for both async mode and sync mode.

@jasperzhong
Copy link
Contributor

jasperzhong commented Jan 21, 2021

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

handle = byteps_push_pull(p, average=False, name="AsyncParam."+name)

@ymjiang
Copy link
Member

ymjiang commented Jan 21, 2021

@ymjiang I think it should be Parameter here? AsyncParam in servers will be initialized with random values.

handle = byteps_push_pull(p, average=False, name="AsyncParam."+name)

I do not get it. The stored buffer will be initialized with the first incoming recv, which contains the value of the parameters.

bps_reducer_->copy(stored->tensor, recved,

@ymjiang
Copy link
Member

ymjiang commented Jan 21, 2021

@ruipeterpan Would you test v0.2.4? My loss curve with v0.2.4 seems fine (it is at least decreasing).

@jasperzhong
Copy link
Contributor

The first incoming recv should be random values.

char *data = const_cast<char *>(static_cast<const char *>(context.cpubuff));

@ruipeterpan
Copy link
Author

@ymjiang Here's the loss curve I got for both sync and async using v0.2.4 (809ef20)

Screen Shot 2021-01-21 at 10 10 53 PM

@jasperzhong
Copy link
Contributor

@ruipeterpan Please try this commit. 7ac1dc7

@ruipeterpan
Copy link
Author

ruipeterpan commented Jan 22, 2021

@vycezhong Here's what I got using 7ac1dc7 and the original scripts (bps_issue.py) I provided:

Screen Shot 2021-01-21 at 10 57 17 PM

Then I commented out a metric_average() on the training loss after each epoch (this part), and here's what I got:

Screen Shot 2021-01-21 at 11 04 35 PM

Let me know if I can help with some more tests and I'll respond ASAP. Thanks for the help!

@ymjiang
Copy link
Member

ymjiang commented Jan 22, 2021

@ruipeterpan I think you may need to reduce the learning rate for the async-mode.

I am not sure what value is appropriate, but could you try lr/N? N is the number of workers.

@ruipeterpan
Copy link
Author

@ymjiang Here's what I got using 7ac1dc7. The default is 0.05 and the loss curve was still going up after setting the lr to 0.0125. I also tried out some other learning rates, and in general, the larger the learning rate, the faster the loss curve goes up.

Screen Shot 2021-01-22 at 1 54 30 AM

@jasperzhong
Copy link
Contributor

jasperzhong commented Jan 23, 2021

@ymjiang
It is because parameter broadcasting also becomes asynchronous. The buffer is initialized with random values as shown in the figure below.
image

I suggest removing the copy and initialize the buffer with zeros.

bps_reducer_->copy(stored->tensor, recved,
len); // we may not need this copy

memset(stored->tensor, 0, stored->len);

But this did not completely solve the problem. Parameter broadcasting should be synchronous. Now we rely on the time delay of non-root workers to get the thing done right, like p.fill_(0).

if rank() != root_rank:
p.fill_(0)
# Remember to disable averaging because we are doing broadcast
if name:
handle = byteps_push_pull(p, average=False, name=prefix+name)
else:
handle = byteps_push_pull(p, average=False)
synchronize(handle)

@ruipeterpan Please try this commit. 18699f8

@ruipeterpan
Copy link
Author

@vycezhong Here's what I got using 18699f8 with 4 workers + 1 server. I don't know if this is related, but I should note that in the first epoch in the first run, worker 0 got a loss of ~9 while all other workers got ~2.3. In the subsequent execution of the scripts, this issue was gone. The following screenshots are produced in the subsequent runs.

Screen Shot 2021-01-23 at 12 20 59 PM

Here's what I got using 18699f8 with 4 workers + 4 servers.

Screen Shot 2021-01-23 at 12 37 18 PM

Thank you!

@tonyzhao-jt
Copy link

tonyzhao-jt commented Jun 22, 2022

Hi, I encountered the same problem again in the current version of BPS.

[05:05:15] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[05:05:15] byteps/server/server.cc:430: BytePS server engine uses 8 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:05:15] src/postoffice.cc:25: Creating Van: zmq
[05:05:15] src/./zmq_van.h:66: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[05:05:15] src/./zmq_van.h:71: BYTEPS_ZMQ_NTHREADS set to 4
[05:05:15] src/van.cc:441: Bind to [[role=server, ip=172.31.41.74, port=38517, is_recovery=0, aux_id=-1]
05:05:15] src/./zmq_van.h:299: Start ZMQ recv thread
[05:05:29] src/van.cc:387: S[8] is connected to others
[05:05:29] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]


terminate called after throwing an instance of 'dmlc::Error'
  what():  [05:05:29] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 10551296 first

Stack trace returned 9 entries:
[bt] (0) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2268c) [0x7fa77e84d68c]
[bt] (1) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x22acd) [0x7fa77e84dacd]
[bt] (2) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7fa77e846ea2]
[bt] (3) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x912) [0x7fa77e848fc2]
[bt] (4) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x2472e) [0x7fa77e84f72e]
[bt] (5) /opt/conda/lib/python3.7/site-packages/byteps/server/c_lib.cpython-37m-x86_64-linux-gnu.so(+0x41491) [0x7fa77e86c491]
[bt] (6) /opt/conda/lib/libstdc++.so.6(+0xc9039) [0x7fa77e77f039]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa77ee116db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa77eb3a61f]


Aborted (core dumped)
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 249, in launch_bps
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 -c 'import byteps.server'' returned non-zero exit status 134.

I also use the example provided here. example
FYI, the bps setup
Scheduler:

export DMLC_NUM_WORKER=2
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Server

export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234  # the scheduler port
export BYTEPS_SERVER_ENGINE_THREAD=8
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch

Worker 0

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI= <my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

Worker1

export NVIDIA_VISIBLE_DEVICES=0,1,2,3
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_PS_ROOT_URI=<my ip>
export DMLC_PS_ROOT_PORT=1234 # the scheduler port
export PS_VERBOSE=1
export BYTEPS_ENABLE_ASYNC=1
bpslaunch python3 benchmark_byteps.py

I can run without this bug in a synchronized version, and also ok for other synchronized training codes. Is there any new bug related to the asynchronous training?
And what to mention is that there is some bug in the sample code now.

_init__.py", line 398, in broadcast_optimizer_state
76: Stopping W[9]
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/van.cc:104: W[9] is stopped
    p = torch.Tensor([p]).cuda()
TypeError: must be real number, not NoneType
[05:17:18] src/./zmq_van.h:81: W all threads joined and destroyed
Traceback (most recent call last):
  File "/opt/conda/bin/bpslaunch", line 253, in <module>
    launch_bps()
  File "/opt/conda/bin/bpslaunch", line 239, in launch_bps
    t[i].join()
  File "/opt/conda/bin/bpslaunch", line 34, in join
    raise self.exc
  File "/opt/conda/bin/bpslaunch", line 27, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/opt/conda/bin/bpslaunch", line 193, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)

Thanks for your attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants