Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

Open
cat-bro opened this issue Jun 2, 2022 · 5 comments · May be fixed by #329
Open

Skip postprocessing POST retries when the file does not exist on the pulsar machine #298

cat-bro opened this issue Jun 2, 2022 · 5 comments · May be fixed by #329

Comments

@cat-bro
Copy link
Contributor

cat-bro commented Jun 2, 2022

Setting max_retries to retry posts of output files to galaxy is extremely useful, since galaxy is sometimes restarting or too busy to receive the post. The retries also occur when the file does not exist on pulsar and this is not useful, because if the file does not exist upon the completion of a job it will not exist X retries later. Most often the output files are missing because a job has failed. Depending on the settings and the number of expected outputs, a user might have to wait over an hour to find out that their job has failed. Nonexistence of expected output files could be handled by a separate check, prior to the retry loop, and retries skipped in this case.

@neoformit
Copy link

Looking into this

@natefoo
Copy link
Member

natefoo commented Sep 29, 2022

This was useful for me when I had filesystem problems on the Pulsar side where the filesystem did eventually come back, but I agree that it is far more nuisance than help in the overwhelming majority of cases. I'd typically just prefer to fail and rerun the job for the rare occurrence of filesystem problems rather than have this happen for legitimate job failures.

@neoformit
Copy link

I've been working on a fix for this (for a while, in the background) but it creates a nasty UX issue for many of our users. Any job run on Pulsar which fails, the user needs to wait an hour to get the fail message back. With AlphaFold, this also means an hour of Azure GPU time wasted! Perhaps we can add an additional check for a failed status before aborting the retry, either way I would plan on making this configurable.

@natefoo
Copy link
Member

natefoo commented Sep 30, 2022

The issue is (partly) that in most cases, Pulsar is not really the arbiter of what is failed. It simply dutifully copies things back to Galaxy and then lets Galaxy decide. That said, failing to copy back outputs (after that long delay) is one of the things that does result in Pulsar informing Galaxy that the job failed.

As @cat-bro said, I think we're best off just not retrying when the file does not exist, or at least having a separate configurable - you could have NFS attribute caching issues that would cause you to want to retry a few times, but not extensively like you might for if posting it to Galaxy fails.

@neoformit
Copy link

Yep, we're in agreement on that. I was suggesting that we can also check for a job-failed status before aborting a missing-file retry loop? That still allows for NFS issues (etc.) to be resolved on a successful job. Do you think there's a way to do that consistently? Or is there no good way for Pulsar to determine that based on the job working directory?

@neoformit neoformit linked a pull request Jul 15, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Backend Working Group
Awaiting triage
Development

Successfully merging a pull request may close this issue.

3 participants