Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better logic in finding a temp directory for the Toil coordination directory #4918

Merged
merged 20 commits into from
May 21, 2024

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented May 6, 2024

Addresses #4914 and #4910

This currently lets the Toil bus tmpdir respect the --tmpdir-prefix from cwltool, if it exists. However, a lot of internal code depends on directory paths instead of prefixed paths, so the tmp coordination directory selection does not currently respect the full prefix path.

This does add another option to Toil's config object, without an associated add_argument (at least for wdl + default toil). If that is not ideal, maybe Toil should get a command line argument to specify where the tmp directory should be?

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil coordination directory will use an available tmp directory if available (respects environmental variables)
  • Message bus's tmp directory will respect cwltool's --tmpdir-prefix

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work right for relative-path --tmpdir-prefix values; do we know those aren't allowed?

Also, I think we could do better at actually respecting --tmpdir-prefix when it is in fact a prefix by creating a temporary directory with that prefix. Maybe take the basename and use it as prefix and the dirname and use it as dir to mkdtemp?

Also also, instead of manually carrying through the prefix to use in particular places where we make temp files on the leader, maybe we should be setting tempfile.tempdir on the leader and the worker if we have a tmpdir_prefix?

"""
# if tmpdir_prefix exists, return its parent or None depending on if it is accessible
if tmpdir_prefix is not None:
return tmpdir_prefix if os.path.isdir(tmpdir_prefix) else os.path.dirname(tmpdir_prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want tmpdir_prefix if os.path.isdir(tmpdir_prefix). If tmpdir_prefix is /tmp/cwltmp- but the directory /tmp/cwltmp- happens to exist, we still want to make temproary files and directories like /tmp/cwltmp-XXXX and not change to making them in that directory. The CWL-style tmpdir prefix will end in / if it is itself a directory we are meant to make stuff in.

Comment on lines 1290 to 1291
# gettempdir returns the current working directory as a last resort
# but we don't want that, so return None if the cwd is reached
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if TMPDIR is /tmp and we happen to be standing in /tmp when running the workflow? I don't think we want to disqualify a directory just because it's the current directory. If it has to fall all the way back to the current directory as our temp directory, maybe we should in fact use it as our temp directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to prioritize Toil's work dir over the current working directory? If I don't omit the current directory from gettempdir, then I think Toil's work dir will never be reached, as gettempdir() should always be returning something (unless toil is somehow installed in a nonwritable directory).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could prioritize the workDir below actual temp directories like the ones from the environment variables and /tmp, but above the current working directory as a fallback if /tmp is unavailable.

But if the current working directory is itself /tmp, we still want to use /tmp.

Comment on lines 1280 to 1281
If tmpdir_prefix is given, assuming it is accessible, it will return its parent directory
if the path itself is not a directory path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might recommend creating a directory with that prefix. Then we could obey it.

Copy link
Contributor Author

@stxue1 stxue1 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. The only downside is there is no easy way without a refactor to just tack on the workflow_id onto the prefix path and use that as the coordination directory. The current solution tacks the workflow_id on but then lets get_local_workflow_coordination_dir create another subdirectory with the same workflow_id name. This is since the batchsystem cleanup depends on get_toil_coordination_dir to return the parent directory of the coordination files, even though all the coordination directory logic is there. So in the current setup, I think I have to return a full directory path. I think the only way to respect the prefix probably is to either carry arguments down the function call (as I did) or carry return values up (so get_local_workflow_coordination_dir would understand when get_toil_coordination_dir returns a prefix/is not a full directory path)

src/toil/bus.py Outdated
"""
Return a file path in tmp to store the message bus at.
Calling function is responsible for cleaning the generated file.
"""
fd, path = tempfile.mkstemp()
fd, path = tempfile.mkstemp(prefix=tmpdir_prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like using an absolute path as a prefix actually does work. We might need a comment asserting this; I thought this was wrong when I looked at it, and the docs don't seem to promise that this works.

What if the --tmpdir-prefix option is a relative path? It looks like mkstemp interprets a non-absolute prefix as relative to its chosen temp directory, not relative to the current directory. Will something turn it into an absolute path before we get here? If so, we should cover that in the docstring.

Copy link
Contributor Author

@stxue1 stxue1 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the prefix passed into mkstemp is a string it looks like it will try to prefix that into the filename itself, choosing a valid tmp directory option

>>> tmpdir_prefix = "test"
>>> fd, path = tempfile.mkstemp(prefix=tmpdir_prefix)
>>> path
'/tmp/testmo78tq3y'

Looks like it fails if it's a relative path and the parent directory that it points at doesn't exist

>>> fd, path = tempfile.mkstemp(prefix=tmpdir_prefix)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/tempfile.py", line 496, in mkstemp
    return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/tempfile.py", line 395, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/testfolder/testqevkrm39'

I'm not sure how cwltool handles this case but we currently don't turn it into an absolute path before here, as this carries the option directly from cwltool's argparse.parse_args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cwltool handles relative paths by creating the tmp file relative to the current working directory. I can probably mimic that.

@mr-c
Copy link
Contributor

mr-c commented May 7, 2024

@stxue1 Thank you!

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there must be a simpler way to accomplish:

  1. toil-cwl-runner --tmpdir-prefix whatever doesn't write to /tmp.
  2. workflow.py --workDir /shared/fs doesn't use /shared/fs for the coordination directory.

While preserving the somewhat important security guarantee of:
3. Only secure functions like tempfile.mkstemp and tempfile.mkdtemp are used to create files in world-writable temp directories.

I think that, rather than trying to solve the hard problem of getting all the workers to agree on startup about what the coordination directory is, given only a prefix for which any given name including it might be taken, we should instead expand what toil-cwl-runner already does for the workDir and, at workflow start, come up with paths obeying that prefix for the coordination directory and the Python process temp directory. (We could maybe make sure to clean them all up at CWL interpreter shutdown.)

Then we don't have to touch the message bus path generator at all (since it uses the Python process temp directory), and we don't have to let the somewhat fiddly concept of a temp directory prefix leak out of the CWL interpreter implementation to create hard problems in other parts of Toil, that then create complexity when we solve them.

If we do add tmpdir_prefix to the config, and we don't do something to teach the tempfile module to obey it, we are not going to be able to use the normal functions in tempfile to make internal Toil temp files, because they won't obey the prefix. That's a thing we'd have to remember at more or less all points in the Toil codebase, and we'd have to always pass around the config or prefix to anything that might need a temp file, whether CWL-related or not, so if we can avoid it we should.

Comment on lines 1333 to 1334
try_path('/run/lock') or
# Before trying the workdir, try the /tmp directory as tmp directories
# should be local to the node and not shared like a work directoy might be
try_path('/tmp/toil_coordination') or
# Finally, fall back on the work dir and hope it's a legit filesystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we're not actually avoiding using the workdir, for anything that isn't toil-cwl-runner.

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this will work. The workdir fallback is now I think dead code.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@DailyDreaming DailyDreaming merged commit b98a20b into master May 21, 2024
1 check passed
@mr-c mr-c deleted the issues/4877-better-tmp-logic-coordination-bus branch May 21, 2024 11:02
stxue1 added a commit that referenced this pull request May 23, 2024
…n directory (#4918)

* Try /tmp before the workdir

* Add better logic for coordination dir tmpdir fallback; respect cwltool argument for message bus

* Fix prefix/dir mixup

* Carry through workflow_id to deal with prefixes in a return-directory-only function

* mypy, make workflow_id argument mandatory

* Use coordination_dir instead of a new config attribute

* Add back tempdir to coordination_dir fallback

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not write to /tmp when --tmpdir-prefix is set
4 participants