-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this description of NCCL_BUFFSIZE's role appropriate? #1252
Comments
Hello, @sjeaugey @kwen2501 sorry for tagging you with any permission,, I was looking at the answers in the issues and thought you guys might be able to answer my question. Is there any additional features of Here's a image @sjeaugey had attached (#157 (comment), #353 (comment)) Pardon the rudeness Best Regards |
Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption. For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend. |
I deeply appreciate your generous response even though I tagged you without permission. You mean that NCCL can behave in a non-blocking manner if Can I ask few more question? I have a distributed scenario with multiple collective communication can collide in simple dumbbell topology (RoCEv2, every link is 40Gbps) AllReduce and (AllGather,ReduceScatter) might collide and I confirmed two overlapping NCCL kernel with Nsight I thought this would cause congestion because DCQCN, the RDMA NIC's rate control algorithm, always starts sending data at the line rate. At first, I couldn't observe any congestion signals (e.g., PFC PAUSE, CNP (congestion notification packets)) However, when I changed some NCCL ENV variables, congestion could be observed
Referring some information about protocol, I can understand why LL protocol doesn't make congestion. Here's my questions 1a) Does NCCL changes algorithm (e.g., Ring, Tree) and protocol during runtime? or be static with optimal ALGO, PROTO probing before running 2a) Larger 2b) Does NCCL Engages NIC's rate control algorithm? (i.e., DCQCN) If not, how could 2c) In #157 (comment), you mentioned that Sorry for bother you with many questions.. If you can let me know your knowledge as possible, it will be super great help on me. Best Regards, |
Hi, @sjeaugey Any updates? |
Hi All
I understood NCCL_BUFFSIZE's role was to set the Chuncksize until now. (#157 (comment), #353 (comment))
However, In 5.1 of https://www.usenix.org/system/files/atc23-choi.pdf describe that Increasing
NCCL_BUFFSIZE
can make NCCL become non-blockingNCCL’s default blocking p2p communication can be easily changed to non-blocking communication by increasing the NCCL buffer size with NCCL_BUFFSIZE environment variable. NCCL buffer is used when communicating data between pairs of GPUs. P2p send operation fills up the target GPU’s buffer and the target GPU fetches data from the buffer in FIFO for another send operation to fill the buffer. If the NCCL buffer is full, send operation should wait until the buffer of target GPU has free space. If the NCCL buffer has enough free space, p2p send operation can complete without waiting for p2p recv operation to be called from target GPU. ENVPIPE makes sure that NCCL buffer size is enough to handle all activations and gradients to be communicated in a non-blocking way to effectively use bubbles.
Is this interpretation also correct?
Then how does sender GPU can recognize and wait until when target GPU has free space?
Best Regards
Taekyoung
The text was updated successfully, but these errors were encountered: