fix: do not keep retrying accept
if handshake fails for "proxy" type sockets
#1064
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In
socketFinalizeAccept
, if the handshake (magic validation) fails, we should return failure for "proxy" sockets.When
ncclProxyService
tries accepting new connection, with the current behavior, a spurious connection can causencclSocketAccept
endlessly burns CPU cycles. AsncclProxyService
marks the listening socket non-blocking, once a spurious connection causesocketFinalizeAccept
to fail, we immediately retryncclTryAccept
, and getEAGAIN
, and retryncclTryAccept
...Unfortunately we can't just change the default behavior though. Several other callsites rely on
ncclSocketAccept
to ignore spurious connections for them, e.g., callsites inbootstrap.c
. IMHO, those callsites should be refactored and either 1) do retries themselves, or 2) use something new such asncclSocketAcceptBlockUntilHandshakeSucceeds
.