Coordination directory moves with the work directory when that's not advisable #4877

adamnovak · 2024-04-24T14:03:05Z

For a few reasons, on our Slurm cluster, Toil is putting the coordination directory at the location of the work directory, since it can't use any of the available tmpfs locations.

But then when a user has a workflow that can't fit its scratch in node-local scratch, and moves the work directory to our Ceph share, the coordination directory goes with it. Then the aLastProcessStandingArena will hammer Ceph with O(n^2) file locking attempts in rapid succession as workers try and see if they are the last one alive on their node, and this can hang the Ceph MDS, since the Ceph MDS apparently isn't quite correct in its internal locking setup, and in any case isn't designed for this workload.

The workaround is to set TOIL_COORDINATION_DIR=/data/tmp whenever you set the work directory onto shared storage.

The fix is to not make the coordination directory follow the work directory, and to work harder to keep it local to the node. We should maybe just use /tmp if it's there, even if it's not technically the selected temp directory.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1543

The text was updated successfully, but these errors were encountered:

adamnovak added the bug label Apr 24, 2024

unito-bot assigned adamnovak Apr 29, 2024

stxue1 mentioned this issue May 3, 2024

Try /tmp before the workdir for the Toil coordination directory #4914

Merged

19 tasks

DailyDreaming closed this as completed in #4914 May 6, 2024

unito-bot reopened this May 7, 2024

unito-bot closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordination directory moves with the work directory when that's not advisable #4877

Coordination directory moves with the work directory when that's not advisable #4877

adamnovak commented Apr 24, 2024 •

edited by unito-bot

Coordination directory moves with the work directory when that's not advisable #4877

Coordination directory moves with the work directory when that's not advisable #4877

Comments

adamnovak commented Apr 24, 2024 • edited by unito-bot

adamnovak commented Apr 24, 2024 •

edited by unito-bot