Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort on exception without panicking #318

Open
davidselassie opened this issue Nov 1, 2023 · 2 comments
Open

Abort on exception without panicking #318

davidselassie opened this issue Nov 1, 2023 · 2 comments
Labels
needs triage New issue, needs triage

Comments

@davidselassie
Copy link
Contributor

Currently, when user logic throws an exception into the Bytewax Rust layer, we panic from within the Timely operator, attaching the Python exception to the panic, then we catch_unwind in the main functions that spawn workers. This works, but has some tricky things that are undesirable:

  • Timely "double panics" when in cluster mode. Something about the communications code uses queues that get poisoned on panic.
  • Not having per-thread panic handlers. This means we don't actually have programmatic access to all exceptions if multiple threads throw them.
  • Timely's multiple-worker API flattens all panic info into a String so we don't have programmatic access to the exception anyway. (Single worker does allow this, though.) We end up printing out the exceptions to the terminal, then raising a generic exception that says "look at the scrollback".
  • My desire to not panic.

Alternate idea: Have a Arc<Mutex<Vec<PyErr>>> in each worker that is checked when we do worker.step and have each operator have a handle to throw any exceptions they get in there. This is sort of "write to global state, then check it in our run loop." Since we're managing the exception object the whole time, we have access to it at full fidelity and we gracefully stop the worker.

Things that are slightly tricky with this idea:

  • Closures in operator code. Some Timely objects that use closures like Notifictators don't return values, so we might have to fill this vec from slightly more places. I don't think this will be too big a deal though.
  • Cluster mode: I think we need a broadcast dataflow stream that one worker that stops can signal to all other workers that they should stop. It's possible that we'll need to do something with probes to know that the crashing worker has sent this signal over Timely's queues before stopping the run loop. Not sure if that's a deal-breaker for this idea.
@github-actions github-actions bot added the needs triage New issue, needs triage label Nov 1, 2023
@davidselassie
Copy link
Contributor Author

As part of totally getting rid of the unwrap_any macro, might also have to allow reraise_with to take a closure that returns a PyResult so you can bubble up exceptions building exception messages.

@davidselassie
Copy link
Contributor Author

Another benefit here: if we can completely get rid of using panic! as part of normal Python exception handling, it should be possible to compile Bytewax in panic=abort mode. This is said to improve compilation times, binary size, and performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage New issue, needs triage
Projects
None yet
Development

No branches or pull requests

1 participant