Support skipping + retry recipes for failure recovery (aka "skipsies") #670

abarciauskas-bgse · 2024-01-18T16:48:11Z

I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.

It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.

cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"

norlandrhagen · 2024-01-18T17:00:32Z

Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.

ranchodeluxe · 2024-01-18T17:41:07Z

Ha, I had a similar ticket I closed yesterday 😄

I like the Nan route as a last resort

Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

abarciauskas-bgse commented Jan 18, 2024

norlandrhagen commented Jan 18, 2024

ranchodeluxe commented Jan 18, 2024

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

Comments

abarciauskas-bgse commented Jan 18, 2024

norlandrhagen commented Jan 18, 2024

ranchodeluxe commented Jan 18, 2024