Graceful impromptu shutdown #268

davidselassie · 2023-07-27T18:03:57Z

Inspired by thinking about the answer to this question in slack. I don't know that this is a good idea, but wanted to put it somewhere for posterity.

When a dataflow's input reaches EOF and thus all data is flushed out of the dataflow and it stops, that's a graceful shutdown. Currently SIGINT does an abrupt shutdown; it stops running the main worker step loop, so any in flight functions are finished and it does not throw any exceptions or panics, but it does not purposefully flush any data out of the dataflow or strengthen any resume guarantees above an abort shutdown by throwing an exception. I guess the process return code is different, but I think that's it?

It might be interesting to either change the behavior of SIGINT (or add a response to SIGHUP) to instead "increment epoch immediately and induce EOF in all inputs then wait for graceful shutdown". This would mean SIGINT induces the coordinated checkpoint process and thus you could actually ensure you resume from that point instead of the last epoch snapshot.

There's sort of the question of "does this enable anything meaningful?"

If for some reason you have to have very long epoch intervals and you want to reduce resume time. You can ensure you have a very recent snapshot before killing the cluster.
This would futz with system time windows because you would force window closure. But system time windows are non-deterministic anyway.
You still have no way of coordinating the offset of the input sources when you SIGINT so you can't really guarantee any coherent property of the output when you do this.
It might be more useful to add a way to trigger this behavior from an output operator? Since output is ostensibly where your answers are finally known, that's the only point in the dataflow where you have coherent info of if you've processed enough data and could initiate early shutdown.

I'm still not totally sure on the use cases for "stop the dataflow early" and would like to learn more about them before making suggestions here.

The text was updated successfully, but these errors were encountered:

davidselassie · 2024-01-08T21:06:02Z

A single-process testing-only version of this is implemented in #317

davidselassie added the type:feature New feature request label Jul 27, 2023

github-actions bot added the needs triage New issue, needs triage label Jul 27, 2023

davidselassie removed the needs triage New issue, needs triage label Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful impromptu shutdown #268

Graceful impromptu shutdown #268

davidselassie commented Jul 27, 2023

davidselassie commented Jan 8, 2024

Graceful impromptu shutdown #268

Graceful impromptu shutdown #268

Comments

davidselassie commented Jul 27, 2023

davidselassie commented Jan 8, 2024