You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by thinking about the answer to this question in slack. I don't know that this is a good idea, but wanted to put it somewhere for posterity.
When a dataflow's input reaches EOF and thus all data is flushed out of the dataflow and it stops, that's a graceful shutdown. Currently SIGINT does an abrupt shutdown; it stops running the main worker step loop, so any in flight functions are finished and it does not throw any exceptions or panics, but it does not purposefully flush any data out of the dataflow or strengthen any resume guarantees above an abort shutdown by throwing an exception. I guess the process return code is different, but I think that's it?
It might be interesting to either change the behavior of SIGINT (or add a response to SIGHUP) to instead "increment epoch immediately and induce EOF in all inputs then wait for graceful shutdown". This would mean SIGINT induces the coordinated checkpoint process and thus you could actually ensure you resume from that point instead of the last epoch snapshot.
There's sort of the question of "does this enable anything meaningful?"
If for some reason you have to have very long epoch intervals and you want to reduce resume time. You can ensure you have a very recent snapshot before killing the cluster.
This would futz with system time windows because you would force window closure. But system time windows are non-deterministic anyway.
You still have no way of coordinating the offset of the input sources when you SIGINT so you can't really guarantee any coherent property of the output when you do this.
It might be more useful to add a way to trigger this behavior from an output operator? Since output is ostensibly where your answers are finally known, that's the only point in the dataflow where you have coherent info of if you've processed enough data and could initiate early shutdown.
I'm still not totally sure on the use cases for "stop the dataflow early" and would like to learn more about them before making suggestions here.
The text was updated successfully, but these errors were encountered:
Inspired by thinking about the answer to this question in slack. I don't know that this is a good idea, but wanted to put it somewhere for posterity.
When a dataflow's input reaches EOF and thus all data is flushed out of the dataflow and it stops, that's a graceful shutdown. Currently
SIGINT
does an abrupt shutdown; it stops running the main worker step loop, so any in flight functions are finished and it does not throw any exceptions or panics, but it does not purposefully flush any data out of the dataflow or strengthen any resume guarantees above an abort shutdown by throwing an exception. I guess the process return code is different, but I think that's it?It might be interesting to either change the behavior of
SIGINT
(or add a response toSIGHUP
) to instead "increment epoch immediately and induce EOF in all inputs then wait for graceful shutdown". This would meanSIGINT
induces the coordinated checkpoint process and thus you could actually ensure you resume from that point instead of the last epoch snapshot.There's sort of the question of "does this enable anything meaningful?"
If for some reason you have to have very long epoch intervals and you want to reduce resume time. You can ensure you have a very recent snapshot before killing the cluster.
This would futz with system time windows because you would force window closure. But system time windows are non-deterministic anyway.
You still have no way of coordinating the offset of the input sources when you
SIGINT
so you can't really guarantee any coherent property of the output when you do this.It might be more useful to add a way to trigger this behavior from an output operator? Since output is ostensibly where your answers are finally known, that's the only point in the dataflow where you have coherent info of if you've processed enough data and could initiate early shutdown.
I'm still not totally sure on the use cases for "stop the dataflow early" and would like to learn more about them before making suggestions here.
The text was updated successfully, but these errors were encountered: