Abort on exception without panicking #318

davidselassie · 2023-11-01T16:31:39Z

Currently, when user logic throws an exception into the Bytewax Rust layer, we panic from within the Timely operator, attaching the Python exception to the panic, then we catch_unwind in the main functions that spawn workers. This works, but has some tricky things that are undesirable:

Timely "double panics" when in cluster mode. Something about the communications code uses queues that get poisoned on panic.
Not having per-thread panic handlers. This means we don't actually have programmatic access to all exceptions if multiple threads throw them.
Timely's multiple-worker API flattens all panic info into a String so we don't have programmatic access to the exception anyway. (Single worker does allow this, though.) We end up printing out the exceptions to the terminal, then raising a generic exception that says "look at the scrollback".
My desire to not panic.

Alternate idea: Have a Arc<Mutex<Vec<PyErr>>> in each worker that is checked when we do worker.step and have each operator have a handle to throw any exceptions they get in there. This is sort of "write to global state, then check it in our run loop." Since we're managing the exception object the whole time, we have access to it at full fidelity and we gracefully stop the worker.

Things that are slightly tricky with this idea:

Closures in operator code. Some Timely objects that use closures like Notifictators don't return values, so we might have to fill this vec from slightly more places. I don't think this will be too big a deal though.
Cluster mode: I think we need a broadcast dataflow stream that one worker that stops can signal to all other workers that they should stop. It's possible that we'll need to do something with probes to know that the crashing worker has sent this signal over Timely's queues before stopping the run loop. Not sure if that's a deal-breaker for this idea.

The text was updated successfully, but these errors were encountered:

davidselassie · 2023-11-06T20:49:00Z

As part of totally getting rid of the unwrap_any macro, might also have to allow reraise_with to take a closure that returns a PyResult so you can bubble up exceptions building exception messages.

davidselassie · 2024-05-22T22:10:22Z

Another benefit here: if we can completely get rid of using panic! as part of normal Python exception handling, it should be possible to compile Bytewax in panic=abort mode. This is said to improve compilation times, binary size, and performance.

github-actions bot added the needs triage New issue, needs triage label Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort on exception without panicking #318

Abort on exception without panicking #318

davidselassie commented Nov 1, 2023

davidselassie commented Nov 6, 2023

davidselassie commented May 22, 2024

Abort on exception without panicking #318

Abort on exception without panicking #318

Comments

davidselassie commented Nov 1, 2023

davidselassie commented Nov 6, 2023

davidselassie commented May 22, 2024