How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286

Ryan201802 · 2024-05-14T08:57:59Z

          The different NVSwitches are not visible to NCCL. Not for NVLink communication, not for NVLink SHARP. Traffic is spread on all switches transparently.

Originally posted by @sjeaugey in #1006 (comment)

I am confused how to use two levels nvswitch to do allreduce in NCCL.

The text was updated successfully, but these errors were encountered:

AddyLaddy · 2024-05-14T16:17:15Z

It's all handled in the NVSwitch HW and Fabric manager and is opaque to the NCCL and CUDA SW stack.

Also, there are no multi-level NVLink switch system products available from Nvidia currently.

Ryan201802 · 2024-05-15T01:47:53Z

Thank for your help. But @AddyLaddy , how can I understand "inter-node NVLink SHARP since 2.18" mentioned at #895 (comment)
how does the inter-node connect?

Ryan201802 · 2024-05-15T01:54:56Z

Additionally, I find the framework in the GH200 white paper. There is two levels nvswitch.

In this framework, can the second level do NVLink SHARP? If can, how to do in the NCCL?

AddyLaddy · 2024-05-15T01:55:37Z

We have developed so called Multi-Node NVLink systems (MNNVL) and the first publicly available system will be called GB200 (sometimes referred to as NVL72).
GB200 NVL72
It will have 72 Blackwell generation GPUs connected to a single NVLink domain.

AddyLaddy · 2024-05-15T01:58:41Z

The full features of NVLink will be available to all GPUs in such a system, including NVLink SHARP.
NCCL already supports NVLink SHARP on 8x H100 systems, so it's just the same but with a larger NVLink domain.
It's all opaque to NCCL, it just sees 72 GPUs all accessible via NVLink.

Ryan201802 · 2024-05-15T02:23:16Z

Thanks, I understand.
Now only have a question about the GH200 Framework. How does the second level nvswitch work? or how can I use their Sharps unit?

AddyLaddy · 2024-05-15T02:29:15Z

I don't believe we've announced the NVLink topology of GB200 yet.
But NVLink SHARP works in both single and two level NVSwitch networks.
Again, it's all opaque to NCCL. It just works like any other NVLink connected machine that is NVLink SHARP capable.

hennry205 · 2024-05-15T02:34:04Z

Hi @AddyLaddy, I have a question. How does multimem.ld_reduce command in kernel trigger nvswitch to do Reduce function? Because I don't see any direct control of nvswith in the kernel, Thanks.

Ryan201802 · 2024-05-15T06:11:15Z

Yes @AddyLaddy , but the topology I mentioned is GH200, not GB200.

AddyLaddy · 2024-05-15T18:47:31Z

There are no publicly released Nvidia products that use multi level NVSwitches currently. The first publicly available product will be GB200.
NVLink SHARP works with both single level and multi-level NVLink fabrics.

AddyLaddy · 2024-05-15T18:52:11Z

Hi @AddyLaddy, I have a question. How does multimem.ld_reduce command in kernel trigger nvswitch to do Reduce function? Because I don't see any direct control of nvswith in the kernel, Thanks.

The NVLink SHARP implementation is based in HW and is configured by mapping in special "Multicast" enabled buffer addresses to the GPU virtual address space. See CUDA Multicast

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286

How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286

Ryan201802 commented May 14, 2024

AddyLaddy commented May 14, 2024

Ryan201802 commented May 15, 2024

Ryan201802 commented May 15, 2024 •

edited

AddyLaddy commented May 15, 2024

AddyLaddy commented May 15, 2024

Ryan201802 commented May 15, 2024

AddyLaddy commented May 15, 2024

hennry205 commented May 15, 2024

Ryan201802 commented May 15, 2024

AddyLaddy commented May 15, 2024

AddyLaddy commented May 15, 2024

How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286

How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286

Comments

Ryan201802 commented May 14, 2024

AddyLaddy commented May 14, 2024

Ryan201802 commented May 15, 2024

Ryan201802 commented May 15, 2024 • edited

AddyLaddy commented May 15, 2024

AddyLaddy commented May 15, 2024

Ryan201802 commented May 15, 2024

AddyLaddy commented May 15, 2024

hennry205 commented May 15, 2024

Ryan201802 commented May 15, 2024

AddyLaddy commented May 15, 2024

AddyLaddy commented May 15, 2024

Ryan201802 commented May 15, 2024 •

edited