Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

cat-bro · 2022-06-02T01:25:04Z

Setting max_retries to retry posts of output files to galaxy is extremely useful, since galaxy is sometimes restarting or too busy to receive the post. The retries also occur when the file does not exist on pulsar and this is not useful, because if the file does not exist upon the completion of a job it will not exist X retries later. Most often the output files are missing because a job has failed. Depending on the settings and the number of expected outputs, a user might have to wait over an hour to find out that their job has failed. Nonexistence of expected output files could be handled by a separate check, prior to the retry loop, and retries skipped in this case.

The text was updated successfully, but these errors were encountered:

neoformit · 2022-07-07T04:39:37Z

Looking into this

natefoo · 2022-09-29T14:53:13Z

This was useful for me when I had filesystem problems on the Pulsar side where the filesystem did eventually come back, but I agree that it is far more nuisance than help in the overwhelming majority of cases. I'd typically just prefer to fail and rerun the job for the rare occurrence of filesystem problems rather than have this happen for legitimate job failures.

neoformit · 2022-09-29T23:15:27Z

I've been working on a fix for this (for a while, in the background) but it creates a nasty UX issue for many of our users. Any job run on Pulsar which fails, the user needs to wait an hour to get the fail message back. With AlphaFold, this also means an hour of Azure GPU time wasted! Perhaps we can add an additional check for a failed status before aborting the retry, either way I would plan on making this configurable.

natefoo · 2022-09-30T21:14:51Z

The issue is (partly) that in most cases, Pulsar is not really the arbiter of what is failed. It simply dutifully copies things back to Galaxy and then lets Galaxy decide. That said, failing to copy back outputs (after that long delay) is one of the things that does result in Pulsar informing Galaxy that the job failed.

As @cat-bro said, I think we're best off just not retrying when the file does not exist, or at least having a separate configurable - you could have NFS attribute caching issues that would cause you to want to retry a few times, but not extensively like you might for if posting it to Galaxy fails.

neoformit · 2022-10-03T22:33:57Z

Yep, we're in agreement on that. I was suggesting that we can also check for a job-failed status before aborting a missing-file retry loop? That still allows for NFS issues (etc.) to be resolved on a successful job. Do you think there's a way to do that consistently? Or is there no good way for Pulsar to determine that based on the job working directory?

neoformit linked a pull request Jul 15, 2023 that will close this issue

Transfer retries fix #329

Open

natefoo mentioned this issue Feb 20, 2024

Do not retry on 403 error #353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

cat-bro commented Jun 2, 2022

neoformit commented Jul 7, 2022

natefoo commented Sep 29, 2022

neoformit commented Sep 29, 2022

natefoo commented Sep 30, 2022

neoformit commented Oct 3, 2022

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

Comments

cat-bro commented Jun 2, 2022

neoformit commented Jul 7, 2022

natefoo commented Sep 29, 2022

neoformit commented Sep 29, 2022

natefoo commented Sep 30, 2022

neoformit commented Oct 3, 2022