RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch #394

anj-s · 2021-04-30T04:25:55Z

Describe the bug
I am unable to get distributed training running with PyTorch backend. I am consistently running into the RDMA_CM_EVENT_ADDR_ERROR. Can someone take a look and let me know if I am missing something?

Run setup: 2 nodes
node 0: worker 0
node 1: worker 1, server, scheduler

scheduler_hostname = IP of the RDMA interface

perf test works using ib_write_bw
single node training works

To Reproduce
General env vars that are set on workers, scheduler and server
os.environ["DMLC_ENABLE_RDMA"] = "ibverbs"
os.environ["DMLC_INTERFACE"] = "front0"
os.environ["ENABLE_RDMA_LOG"] = "1"
os.environ["PS_VERBOSE"] = "1"
os.environ["BYTEPS_LOG_LEVEL"] = "TRACE"
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_SHM_DISABLE"] = "1"
os.environ["BYTEPS_ENABLE_GDB"] = "0"
os.environ["BYTEPS_RDMA_RX_DEPTH"]="128"
os.environ["BYTEPS_RDMA_START_DEPTH"]="16"

server env vars
os.environ["DMLC_ROLE"] = "server"
os.environ["DMLC_NUM_WORKER"] = 2
os.environ["DMLC_NUM_SERVER"] = 1
os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname
os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT

scheduler env vars
os.environ["DMLC_ROLE"] = "scheduler"
os.environ["DMLC_NUM_WORKER"] = "2"
os.environ["DMLC_NUM_SERVER"] = "1"
os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname
os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT

worker env vars
os.environ["DMLC_ROLE"] = "worker"
os.environ["DMLC_WORKER_ID"] = str(worker_id)
os.environ["DMLC_NUM_WORKER"] = "2"
os.environ["DMLC_NUM_SERVER"] = "1"
os.environ["DMLC_PS_ROOT_URI"] = scheduler_hostname
os.environ["DMLC_PS_ROOT_PORT"] = SCHEDULER_PORT
os.environ["BYTEPS_LOCAL_RANK"] = "0"
os.environ["BYTEPS_LOCAL_SIZE"] = "1'

Expected behavior
Able to run:

command = "python /private/home/anj/.conda/envs/fairscale/bin/bpslaunch python
                       /private/home/anj/byteps_repro/byteps/example/pytorch/train_mnist_byteps.py"
subprocess.check_call(command,
                      stdout=sys.stdout, stderr=sys.stderr, shell=True)

stack trace: https://gist.github.com/anj-s/6c808731287e9a504cb63c6f8013fad0

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):
OS: Ubuntu
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
CUDA and NCCL version:
CUDA: 11.0
NCCL: 2.7.8
Framework (TF, PyTorch, MXNet): PyTorch 1.8

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch #394

RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch #394

anj-s commented Apr 30, 2021

RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch #394

RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch #394

Comments

anj-s commented Apr 30, 2021