Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement *real* per-node cleanup for Slurm specifically #4882

Open
adamnovak opened this issue Apr 25, 2024 · 0 comments
Open

Implement *real* per-node cleanup for Slurm specifically #4882

adamnovak opened this issue Apr 25, 2024 · 0 comments

Comments

@adamnovak
Copy link
Member

adamnovak commented Apr 25, 2024

Building on #4775, Slurm right now implements worker cleanup using the fallback base class that just has the last running job on a node clean up.

If we use this with caching on our Slurm cluster, we won't get a good result for workflows that run one job on a node at a time. Each job will launch, download files into the cache, finish working, see it is the last job on the node, and clean up. Then the next job to schedule on the node will get there and have an empty cache and have to fill it again.

We should implement real cleanup for Slurm, instead of using what it inherits from the AbstractGridEngineBatchSystem. We should have the Slurm batch system keep a set of all the Slurm node names that workflow jobs have run on, and at shutdown it should issue special cleanup jobs pre-assigned to each of those nodes, to do the cleanup work.

Since Slurm doesn't schedule based on disk usage, we don't have to worry about not having an active Slurm job to own the cache, at the Slurm level.

We'll still have to deal with Slurm sometimes not sending the next job to the node that the previous job just cached files on. Eventually we might want data gravity. But that's going to need #3071 and will probably be a whole separate system.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1546

@stxue1 stxue1 mentioned this issue Apr 26, 2024
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant