Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

Open
taekyounghan opened this issue Apr 14, 2024 · 4 comments
Open

Is this description of NCCL_BUFFSIZE's role appropriate? #1252

taekyounghan opened this issue Apr 14, 2024 · 4 comments

Comments

@taekyounghan
Copy link

taekyounghan commented Apr 14, 2024

Hi All

I understood NCCL_BUFFSIZE's role was to set the Chuncksize until now. (#157 (comment), #353 (comment))

However, In 5.1 of https://www.usenix.org/system/files/atc23-choi.pdf describe that Increasing NCCL_BUFFSIZE can make NCCL become non-blocking

NCCL’s default blocking p2p communication can be easily changed to non-blocking communication by increasing the NCCL buffer size with NCCL_BUFFSIZE environment variable. NCCL buffer is used when communicating data between pairs of GPUs. P2p send operation fills up the target GPU’s buffer and the target GPU fetches data from the buffer in FIFO for another send operation to fill the buffer. If the NCCL buffer is full, send operation should wait until the buffer of target GPU has free space. If the NCCL buffer has enough free space, p2p send operation can complete without waiting for p2p recv operation to be called from target GPU. ENVPIPE makes sure that NCCL buffer size is enough to handle all activations and gradients to be communicated in a non-blocking way to effectively use bubbles.

Is this interpretation also correct?

Then how does sender GPU can recognize and wait until when target GPU has free space?

Best Regards
Taekyoung

@taekyounghan taekyounghan changed the title Is the description of NCCL_BUFFSIZE's role appropriate? Is this description of NCCL_BUFFSIZE's role appropriate? Apr 14, 2024
@taekyounghan
Copy link
Author

Hello, @sjeaugey @kwen2501 sorry for tagging you with any permission,,

I was looking at the answers in the issues and thought you guys might be able to answer my question.

Is there any additional features of NCCL_BUFFSIZE like non-blocking manner?

Here's a image @sjeaugey had attached

image

(#157 (comment), #353 (comment))

Pardon the rudeness

Best Regards
Taekyoung

@sjeaugey
Copy link
Member

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.

@taekyounghan
Copy link
Author

taekyounghan commented Apr 16, 2024

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.
Hello, @sjeaugey

I deeply appreciate your generous response even though I tagged you without permission.

You mean that NCCL can behave in a non-blocking manner if NCCL_BUFFSIZE is large, but it is not guaranteed. right?

Can I ask few more question?

I have a distributed scenario with multiple collective communication can collide in simple dumbbell topology (RoCEv2, every link is 40Gbps)
image

AllReduce and (AllGather,ReduceScatter) might collide and I confirmed two overlapping NCCL kernel with Nsight
image

I thought this would cause congestion because DCQCN, the RDMA NIC's rate control algorithm, always starts sending data at the line rate.

At first, I couldn't observe any congestion signals (e.g., PFC PAUSE, CNP (congestion notification packets))

However, when I changed some NCCL ENV variables, congestion could be observed

Try number NCCL_BUFFSIZE NCCL_PROTO PFC PAUSE or CNP appears
1 4MiB don't set X
2 8MiB don't set O
3 16MiB don't set O
4 8MiB LL X
5 8MiB LL128 X
6 8MiB Simple O
7 4MiB Simple X
8 16MiB LL128 X
9 64MiB LL128 X

image

Referring some information about protocol, I can understand why LL protocol doesn't make congestion.

Here's my questions

1a) Does NCCL changes algorithm (e.g., Ring, Tree) and protocol during runtime? or be static with optimal ALGO, PROTO probing before running

2a) Larger NCCL_BUFFSIZE tends to increase the amount of congestion. For example, 4MiB buffer size * Simple doesn't make congestion but 8MiB * Simple suffers. How can I understand this? Can NCCL_BUFFSIZE affect bandwidth like protocols do?

2b) Does NCCL Engages NIC's rate control algorithm? (i.e., DCQCN) If not, how could LL128 * 64MiB avoids congestion with 95% (close to simple's 100%) bandwidth? If more collectives exists on the same bottleneck link, won't LL128 also suffer from congestion?

2c) In #157 (comment), you mentioned that NCCL_BUFFSIZE determines the size of chunk, Is that value is different from NCCL_P2P_NET_CHUNKSIZE ?

Sorry for bother you with many questions..

If you can let me know your knowledge as possible, it will be super great help on me.

Best Regards,
Taekyoung

@taekyounghan
Copy link
Author

Like for any MPI implementation, send operations may block. If NCCL has enough buffering and has sent everything to the destination GPU, it may exit once the data is in flight, but this is not guaranteed, can change in the future, therefore you should not base your implementation on that assumption.

For non-blocking to be guaranteed, we'd need to provide something like a MPI_Bsend.

Hi, @sjeaugey Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants