Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful impromptu shutdown #268

Open
davidselassie opened this issue Jul 27, 2023 · 1 comment
Open

Graceful impromptu shutdown #268

davidselassie opened this issue Jul 27, 2023 · 1 comment
Labels
type:feature New feature request

Comments

@davidselassie
Copy link
Contributor

Inspired by thinking about the answer to this question in slack. I don't know that this is a good idea, but wanted to put it somewhere for posterity.

When a dataflow's input reaches EOF and thus all data is flushed out of the dataflow and it stops, that's a graceful shutdown. Currently SIGINT does an abrupt shutdown; it stops running the main worker step loop, so any in flight functions are finished and it does not throw any exceptions or panics, but it does not purposefully flush any data out of the dataflow or strengthen any resume guarantees above an abort shutdown by throwing an exception. I guess the process return code is different, but I think that's it?

It might be interesting to either change the behavior of SIGINT (or add a response to SIGHUP) to instead "increment epoch immediately and induce EOF in all inputs then wait for graceful shutdown". This would mean SIGINT induces the coordinated checkpoint process and thus you could actually ensure you resume from that point instead of the last epoch snapshot.

There's sort of the question of "does this enable anything meaningful?"

  • If for some reason you have to have very long epoch intervals and you want to reduce resume time. You can ensure you have a very recent snapshot before killing the cluster.

  • This would futz with system time windows because you would force window closure. But system time windows are non-deterministic anyway.

  • You still have no way of coordinating the offset of the input sources when you SIGINT so you can't really guarantee any coherent property of the output when you do this.

  • It might be more useful to add a way to trigger this behavior from an output operator? Since output is ostensibly where your answers are finally known, that's the only point in the dataflow where you have coherent info of if you've processed enough data and could initiate early shutdown.

I'm still not totally sure on the use cases for "stop the dataflow early" and would like to learn more about them before making suggestions here.

@davidselassie davidselassie added the type:feature New feature request label Jul 27, 2023
@github-actions github-actions bot added the needs triage New issue, needs triage label Jul 27, 2023
@davidselassie davidselassie removed the needs triage New issue, needs triage label Jul 27, 2023
@davidselassie
Copy link
Contributor Author

A single-process testing-only version of this is implemented in #317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature request
Projects
None yet
Development

No branches or pull requests

1 participant