-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I identify level1 nvswitch and level2 nvswitch in NCCL #1286
Comments
It's all handled in the NVSwitch HW and Fabric manager and is opaque to the NCCL and CUDA SW stack. Also, there are no multi-level NVLink switch system products available from Nvidia currently. |
Thank for your help. But @AddyLaddy , how can I understand "inter-node NVLink SHARP since 2.18" mentioned at #895 (comment) |
We have developed so called Multi-Node NVLink systems (MNNVL) and the first publicly available system will be called GB200 (sometimes referred to as NVL72). |
The full features of NVLink will be available to all GPUs in such a system, including NVLink SHARP. |
Thanks, I understand. |
I don't believe we've announced the NVLink topology of GB200 yet. |
Hi @AddyLaddy, I have a question. How does multimem.ld_reduce command in kernel trigger nvswitch to do Reduce function? Because I don't see any direct control of nvswith in the kernel, Thanks. |
Yes @AddyLaddy , but the topology I mentioned is GH200, not GB200. |
There are no publicly released Nvidia products that use multi level NVSwitches currently. The first publicly available product will be GB200. |
The NVLink SHARP implementation is based in HW and is configured by mapping in special "Multicast" enabled buffer addresses to the GPU virtual address space. See CUDA Multicast |
Originally posted by @sjeaugey in #1006 (comment)
I am confused how to use two levels nvswitch to do allreduce in NCCL.
The text was updated successfully, but these errors were encountered: