Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a configuration option to enable a "fail fast" development mode #1274

Open
kkersten opened this issue Jan 30, 2023 · 0 comments
Open

Comments

@kkersten
Copy link
Collaborator

Problem: the server can be configured in a way that causes an indefinitely hanging job
The current FLARE controller is designed to allow setting the minimum number of required clients along with a server timeout. When min_clients is set to the total number of available clients with server_timeout=0, a failed client will cause the server workflow to hang.

This feature is useful for production use cases, in which the server workflow should be resilient to temporary interruptions in client communication, allowing for clients to temporarily fail and reconnect.

But in cases where a client has failed and is unrecoverable, the server workflow should timeout, independent from the controller workflow configuration. This would also allow a "development mode" in which any client failure causes the server workflow to terminate.

Potential solution
A separate server timeout configuration could be implemented independent of the controller configuration (for example in the server communication layer). This could be configured as a server job timeout, where

  • a timeout of 0 could trigger immediate failure (development mode)
  • a timeout of -1 (inf) would result in current behavior (production mode)
  • a non-zero positive timeout, depending on your level of patience
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant