Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

Open
abarciauskas-bgse opened this issue Jan 18, 2024 · 2 comments

Comments

@abarciauskas-bgse
Copy link
Contributor

I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.

It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.

cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"

@norlandrhagen
Copy link
Contributor

Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.

@ranchodeluxe
Copy link
Contributor

Ha, I had a similar ticket I closed yesterday 😄

I like the Nan route as a last resort

Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants