Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using RDMA capable nodes #34

Open
msalvaris opened this issue Oct 3, 2019 · 3 comments
Open

Using RDMA capable nodes #34

msalvaris opened this issue Oct 3, 2019 · 3 comments

Comments

@msalvaris
Copy link

Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?

@usuyama
Copy link

usuyama commented Jan 2, 2020

I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image)

NCCL_IB_DISABLE
The NCCL_IB_DISABLE variable disables the IB/RoCE transport that is to be used by NCCL. Instead, NCCL will fallback to using IP sockets.

https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html

Wonder if the authors hit any blocking issues using infiniband/rdma @aashna

@usuyama
Copy link

usuyama commented Jan 2, 2020

When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error:

RuntimeError: NCCL error in: ... /torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error

I think NCCL_IB_DISABLE should be set to 0 (or unset), but haven't tried yet.

@usuyama
Copy link

usuyama commented Jan 8, 2020

After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support.

SR-IOV stands for “single root input/output virtualization” which optimizes sharing of PCI Express devices in a system with virtual machines. In Azure, SR-IOV for InfiniBand enables near bare-metal performance for any MPI library.

Accelerating Distributed Training in Azure Machine Learning service using SR-IOV

If you have access to NCv3 or NDv2, then you can take advantage of the faster GPU interconnect. SR-IOV support should come to NCv2 and NDv1 later in 2020.

Without SR-IOV, for NCCL, we need to set "NCCL_IB_DISABLE": "0" to disable InfiniBand on RDMA capable VMs (e.g., ND24rs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants