Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not keep retrying accept if handshake fails for "proxy" type sockets #1064

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

0x804d8000
Copy link

In socketFinalizeAccept, if the handshake (magic validation) fails, we should return failure for "proxy" sockets.

When ncclProxyService tries accepting new connection, with the current behavior, a spurious connection can cause ncclSocketAccept endlessly burns CPU cycles. As ncclProxyService marks the listening socket non-blocking, once a spurious connection cause socketFinalizeAccept to fail, we immediately retry ncclTryAccept, and get EAGAIN, and retry ncclTryAccept...

Unfortunately we can't just change the default behavior though. Several other callsites rely on ncclSocketAccept to ignore spurious connections for them, e.g., callsites in bootstrap.c. IMHO, those callsites should be refactored and either 1) do retries themselves, or 2) use something new such as ncclSocketAcceptBlockUntilHandshakeSucceeds.

@0x804d8000
Copy link
Author

Hi, @sjeaugey would you mind taking a look at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant