Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock file manipulation can stop the workflow if a transient generic "Input/output error" is ever encountered #4874

Closed
adamnovak opened this issue Apr 19, 2024 · 0 comments · Fixed by #4924
Assignees

Comments

@adamnovak
Copy link
Member

adamnovak commented Apr 19, 2024

Related to #4654, the Singularity cache mutex is likely to be on a networked filesystem, and possibly on Ceph.

When going to unlock (or lock?) a file on such a filesystem, it's possible to get an "Input/output error" (errno 5). Usually this would indicate that the hard disk is broken or some other terrible thing is happening, but on distributed filesystems it might just be packet loss or a dropped connection, and trying again might fix the problem.

We should retry attempts to interact with the lock files if we get generic IO errors.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1541

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant