Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle network issues gracefully #11

Open
wsry opened this issue Dec 6, 2021 · 0 comments
Open

Handle network issues gracefully #11

wsry opened this issue Dec 6, 2021 · 0 comments
Assignees

Comments

@wsry
Copy link
Collaborator

wsry commented Dec 6, 2021

Motivation

Currently, network issues like unstable network may cause task failover which may further lead to reproducing of data. In fact, we can improve the behavior by reconnecting and retransmitting to the remote ShuffleWorker.

Changes

For network issues, the client and server should not fail immediately, instead, if the client can reconnect to the server in some timeout, no failover or data reproducing should be triggered. At the same time, we may need a switch which can disable this feature.

Test

  • Unit test.
  • Test manually on a cluster.
@wsry wsry self-assigned this Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant