You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a few reasons, on our Slurm cluster, Toil is putting the coordination directory at the location of the work directory, since it can't use any of the available tmpfs locations.
But then when a user has a workflow that can't fit its scratch in node-local scratch, and moves the work directory to our Ceph share, the coordination directory goes with it. Then the aLastProcessStandingArena will hammer Ceph with O(n^2) file locking attempts in rapid succession as workers try and see if they are the last one alive on their node, and this can hang the Ceph MDS, since the Ceph MDS apparently isn't quite correct in its internal locking setup, and in any case isn't designed for this workload.
The workaround is to set TOIL_COORDINATION_DIR=/data/tmp whenever you set the work directory onto shared storage.
The fix is to not make the coordination directory follow the work directory, and to work harder to keep it local to the node. We should maybe just use /tmp if it's there, even if it's not technically the selected temp directory.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1543
The text was updated successfully, but these errors were encountered:
For a few reasons, on our Slurm cluster, Toil is putting the coordination directory at the location of the work directory, since it can't use any of the available tmpfs locations.
But then when a user has a workflow that can't fit its scratch in node-local scratch, and moves the work directory to our Ceph share, the coordination directory goes with it. Then the aLastProcessStandingArena will hammer Ceph with O(n^2) file locking attempts in rapid succession as workers try and see if they are the last one alive on their node, and this can hang the Ceph MDS, since the Ceph MDS apparently isn't quite correct in its internal locking setup, and in any case isn't designed for this workload.
The workaround is to set
TOIL_COORDINATION_DIR=/data/tmp
whenever you set the work directory onto shared storage.The fix is to not make the coordination directory follow the work directory, and to work harder to keep it local to the node. We should maybe just use
/tmp
if it's there, even if it's not technically the selected temp directory.┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1543
The text was updated successfully, but these errors were encountered: