Question: "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks"? #1248

mkarrmann · 2024-04-08T05:02:16Z

This comment says "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks".

Firstly, I'm not sure if "collective" is being used in an informal sense, if they mean "communicator", or something else. Regardless, if there's any truth to this, I'd like to understand this better. After a fair bit of searching, I haven't been able to find much else suggesting anything along these lines.

Could I please get some clarity on whether or not there is any truth to this? Even if the statement isn't entirely true, would it be most efficient to make all of the GPUs of a node part of a single Communicator as opposed to splitting them into multiple Communicators?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks"? #1248

Question: "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks"? #1248

mkarrmann commented Apr 8, 2024 •

edited

Question: "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks"? #1248

Question: "NCCL needs all GPUs of a host to be part of a collective in other to reliably use NVLinks"? #1248

Comments

mkarrmann commented Apr 8, 2024 • edited

mkarrmann commented Apr 8, 2024 •

edited