Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run most file locking through functions that handle Ceph EIO #4924

Merged
merged 19 commits into from
May 24, 2024

Conversation

adamnovak
Copy link
Member

This will fix #4874.

But it also makes us sit forever retrying locks if fcntl gives an EIO. Since officially the syscall isn't meant to be giving EIO, maybe this is fine?

Changelog Entry

To be copied to the draft changelog by merger:

  • Ceph input/output errors from file locking functions are now tolerated.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@adamnovak
Copy link
Member Author

@DailyDreaming This should be ready for review.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor suggestions.

else:
# Something else went wrong
os.close(fd)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we close the file in either case? i.e.

                except OSError as e:
                    os.close(fd)
                    if e.errno in (errno.EACCES, errno.EAGAIN):
                        # File is still locked by someone else.
                        # Look at the next file instead
                        continue
                    else:
                        # Something else went wrong
                        raise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we do need a close before continuing.

else:
# Something went wrong
os.close(dirFD)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

                except OSError as e:
                    os.close(dirFD)
                    if not e.errno in (errno.EACCES, errno.EAGAIN):
                        # Not a locked file error.  Something went wrong
                        raise

pass
else:
raise
os.close(fd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

    try:
        fcntl.lockf(fd, fcntl.LOCK_UN)
    except OSError as e:
        if e.errno != errno.EIO:
            # Sometimes Ceph produces errno.EIO .  We don't need to retry because
            # we're going to close the FD and after that the file can't remain
            # locked by us.
            raise
    os.close(fd)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't leave the comment really in the right place relative to the code.

@adamnovak adamnovak merged commit 52f1469 into master May 24, 2024
3 checks passed
@adamnovak
Copy link
Member Author

I had to change some fcntl-based locks to flock-based locks, since we were trying to use them on directories and other changes I made revealed that they were not actually ever working.

This probably opens new ways to hang a Toil workflow if you have read access to its temp directories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lock file manipulation can stop the workflow if a transient generic "Input/output error" is ever encountered
2 participants