Is this description of NCCL_BUFFSIZE's role appropriate? #1252

taekyounghan · 2024-04-14T05:47:20Z

Hi All

I understood NCCL_BUFFSIZE's role was to set the Chuncksize until now. (#157 (comment), #353 (comment))

However, In 5.1 of https://www.usenix.org/system/files/atc23-choi.pdf describe that Increasing NCCL_BUFFSIZE can make NCCL become non-blocking

NCCL’s default blocking p2p communication can be easily changed to non-blocking communication by increasing the NCCL buffer size with NCCL_BUFFSIZE environment variable. NCCL buffer is used when communicating data between pairs of GPUs. P2p send operation fills up the target GPU’s buffer and the target GPU fetches data from the buffer in FIFO for another send operation to fill the buffer. If the NCCL buffer is full, send operation should wait until the buffer of target GPU has free space. If the NCCL buffer has enough free space, p2p send operation can complete without waiting for p2p recv operation to be called from target GPU. ENVPIPE makes sure that NCCL buffer size is enough to handle all activations and gradients to be communicated in a non-blocking way to effectively use bubbles.

Is this interpretation also correct?

Then how does sender GPU can recognize and wait until when target GPU has free space?

Best Regards
Taekyoung

The text was updated successfully, but these errors were encountered:

taekyounghan · 2024-04-16T02:05:04Z

Hello, @sjeaugey @kwen2501 sorry for tagging you with any permission,,

I was looking at the answers in the issues and thought you guys might be able to answer my question.

Is there any additional features of NCCL_BUFFSIZE like non-blocking manner?

Here's a image @sjeaugey had attached

(#157 (comment), #353 (comment))

Pardon the rudeness

Best Regards
Taekyoung

sjeaugey · 2024-04-16T07:50:32Z

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.

taekyounghan · 2024-04-16T09:24:46Z

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.
Hello, @sjeaugey

I deeply appreciate your generous response even though I tagged you without permission.

You mean that NCCL can behave in a non-blocking manner if NCCL_BUFFSIZE is large, but it is not guaranteed. right?

Can I ask few more question?

I have a distributed scenario with multiple collective communication can collide in simple dumbbell topology (RoCEv2, every link is 40Gbps)

AllReduce and (AllGather,ReduceScatter) might collide and I confirmed two overlapping NCCL kernel with Nsight

I thought this would cause congestion because DCQCN, the RDMA NIC's rate control algorithm, always starts sending data at the line rate.

At first, I couldn't observe any congestion signals (e.g., PFC PAUSE, CNP (congestion notification packets))

However, when I changed some NCCL ENV variables, congestion could be observed

Try number	`NCCL_BUFFSIZE`	`NCCL_PROTO`	PFC PAUSE or CNP appears
1	4MiB	don't set	X
2	8MiB	don't set	O
3	16MiB	don't set	O
4	8MiB	LL	X
5	8MiB	LL128	X
6	8MiB	Simple	O
7	4MiB	Simple	X
8	16MiB	LL128	X
9	64MiB	LL128	X

Referring some information about protocol, I can understand why LL protocol doesn't make congestion.

Here's my questions

1a) Does NCCL changes algorithm (e.g., Ring, Tree) and protocol during runtime? or be static with optimal ALGO, PROTO probing before running

2a) Larger NCCL_BUFFSIZE tends to increase the amount of congestion. For example, 4MiB buffer size * Simple doesn't make congestion but 8MiB * Simple suffers. How can I understand this? Can NCCL_BUFFSIZE affect bandwidth like protocols do?

2b) Does NCCL Engages NIC's rate control algorithm? (i.e., DCQCN) If not, how could LL128 * 64MiB avoids congestion with 95% (close to simple's 100%) bandwidth? If more collectives exists on the same bottleneck link, won't LL128 also suffer from congestion?

2c) In #157 (comment), you mentioned that NCCL_BUFFSIZE determines the size of chunk, Is that value is different from NCCL_P2P_NET_CHUNKSIZE ?

Sorry for bother you with many questions..

If you can let me know your knowledge as possible, it will be super great help on me.

Best Regards,
Taekyoung

taekyounghan · 2024-04-23T09:03:29Z

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.

Hi, @sjeaugey Any updates?

taekyounghan changed the title ~~Is the description of NCCL_BUFFSIZE's role appropriate?~~ Is this description of NCCL_BUFFSIZE's role appropriate? Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

taekyounghan commented Apr 14, 2024 •

edited

taekyounghan commented Apr 16, 2024

sjeaugey commented Apr 16, 2024

taekyounghan commented Apr 16, 2024 •

edited

taekyounghan commented Apr 23, 2024

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

Comments

taekyounghan commented Apr 14, 2024 • edited

taekyounghan commented Apr 16, 2024

sjeaugey commented Apr 16, 2024

taekyounghan commented Apr 16, 2024 • edited

taekyounghan commented Apr 23, 2024

taekyounghan commented Apr 14, 2024 •

edited

taekyounghan commented Apr 16, 2024 •

edited