Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2 #398

Open
udaykiran009 opened this issue May 30, 2021 · 1 comment

Comments

@udaykiran009
Copy link

Hello,
I was following the Step-by-Step tutorial and try to build from the source code.

The single machine training with DMLC_NUM_WORKER=1 and multiple GPUs is running fine (up to 8 GPUs), but when I tried to run distributed training by only change

DMLC_NUM_WORKER=1
to
DMLC_NUM_WORKER=2

I launched on Single Node, and this node has 8 GPUs in it.

It is giving the following error:
src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
munmap_chunk(): invalid pointer
Aborted (core dumped)

I launched in the following order on the same node.
Worker -> Server -> Worker -> Scheduler

Bash Script for launching Worker-0 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=0
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Worker-1 is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export NVIDIA_VISIBLE_DEVICES=0
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_WORKER_ID=1
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=worker
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch python3 /byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters 100

Bash Script for launching Server is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Bash Script for launching Scheduler is:

export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=ibverbs
export DMLC_NUM_WORKER=2 #if this number is 1 then there is no error 
export DMLC_ROLE=scheduler
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ib0
export DMLC_PS_ROOT_URI=A.B.C.D # scheduler's RDMA interface IP
export DMLC_PS_ROOT_PORT=1234
bpslaunch

Environment:

  • OS: Ubuntu 18.04.5 LTS
  • GCC version: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • CUDA and NCCL version: CUDA 11.0 & NCCL 2.7.8
  • Framework (TF, PyTorch, MXNet): PyTorch

Can you please help me in solving this error. Thank you.

@ymjiang
Copy link
Member

ymjiang commented May 31, 2021

Can you use gdb and paste the backtrace here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants