Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordination directory moves with the work directory when that's not advisable #4877

Closed
adamnovak opened this issue Apr 24, 2024 · 0 comments · Fixed by #4914
Closed

Coordination directory moves with the work directory when that's not advisable #4877

adamnovak opened this issue Apr 24, 2024 · 0 comments · Fixed by #4914
Assignees
Labels

Comments

@adamnovak
Copy link
Member

adamnovak commented Apr 24, 2024

For a few reasons, on our Slurm cluster, Toil is putting the coordination directory at the location of the work directory, since it can't use any of the available tmpfs locations.

But then when a user has a workflow that can't fit its scratch in node-local scratch, and moves the work directory to our Ceph share, the coordination directory goes with it. Then the aLastProcessStandingArena will hammer Ceph with O(n^2) file locking attempts in rapid succession as workers try and see if they are the last one alive on their node, and this can hang the Ceph MDS, since the Ceph MDS apparently isn't quite correct in its internal locking setup, and in any case isn't designed for this workload.

The workaround is to set TOIL_COORDINATION_DIR=/data/tmp whenever you set the work directory onto shared storage.

The fix is to not make the coordination directory follow the work directory, and to work harder to keep it local to the node. We should maybe just use /tmp if it's there, even if it's not technically the selected temp directory.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1543

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants