You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when user logic throws an exception into the Bytewax Rust layer, we panic from within the Timely operator, attaching the Python exception to the panic, then we catch_unwind in the main functions that spawn workers. This works, but has some tricky things that are undesirable:
Timely "double panics" when in cluster mode. Something about the communications code uses queues that get poisoned on panic.
Not having per-thread panic handlers. This means we don't actually have programmatic access to all exceptions if multiple threads throw them.
Timely's multiple-worker API flattens all panic info into a String so we don't have programmatic access to the exception anyway. (Single worker does allow this, though.) We end up printing out the exceptions to the terminal, then raising a generic exception that says "look at the scrollback".
My desire to not panic.
Alternate idea: Have a Arc<Mutex<Vec<PyErr>>> in each worker that is checked when we do worker.step and have each operator have a handle to throw any exceptions they get in there. This is sort of "write to global state, then check it in our run loop." Since we're managing the exception object the whole time, we have access to it at full fidelity and we gracefully stop the worker.
Things that are slightly tricky with this idea:
Closures in operator code. Some Timely objects that use closures like Notifictators don't return values, so we might have to fill this vec from slightly more places. I don't think this will be too big a deal though.
Cluster mode: I think we need a broadcast dataflow stream that one worker that stops can signal to all other workers that they should stop. It's possible that we'll need to do something with probes to know that the crashing worker has sent this signal over Timely's queues before stopping the run loop. Not sure if that's a deal-breaker for this idea.
The text was updated successfully, but these errors were encountered:
As part of totally getting rid of the unwrap_any macro, might also have to allow reraise_with to take a closure that returns a PyResult so you can bubble up exceptions building exception messages.
Another benefit here: if we can completely get rid of using panic! as part of normal Python exception handling, it should be possible to compile Bytewax in panic=abort mode. This is said to improve compilation times, binary size, and performance.
Currently, when user logic throws an exception into the Bytewax Rust layer, we panic from within the Timely operator, attaching the Python exception to the panic, then we
catch_unwind
in the main functions that spawn workers. This works, but has some tricky things that are undesirable:String
so we don't have programmatic access to the exception anyway. (Single worker does allow this, though.) We end up printing out the exceptions to the terminal, then raising a generic exception that says "look at the scrollback".Alternate idea: Have a
Arc<Mutex<Vec<PyErr>>>
in each worker that is checked when we doworker.step
and have each operator have a handle to throw any exceptions they get in there. This is sort of "write to global state, then check it in our run loop." Since we're managing the exception object the whole time, we have access to it at full fidelity and we gracefully stop the worker.Things that are slightly tricky with this idea:
Notifictators
don't return values, so we might have to fill this vec from slightly more places. I don't think this will be too big a deal though.The text was updated successfully, but these errors were encountered: