snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

Kelzor · 2020-11-20T04:54:17Z

Snakemake version
Version ≥26.0 (and possibly other newer versions)

Describe the bug

I am running a large snakemake pipeline with ~90k steps/jobs through a slurm cluster with the following submission commands:

snakemake --snakefile Snakefile -j 50 --use-conda --keep-target-files --keep-going --rerun-incomplete --latency-wait 30 --cluster "sbatch -A keblevin --mem=32768 -n 1 -c 8 -t 3:00:00 -e /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.err -o /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.out"

New jobs slowly stop being submitted such that after 30 minutes to an hour only 30 jobs are maintained in the queue, then 20, then 10, etc. After about 1-2 hours or 800-1500 jobs, new jobs are no longer submitted at all, and the main job hangs indefinitely. The slurm.err log has no errors and also supports the observation that the main job is just hanging without submitting new jobs. Here are the last 25 lines of a slurm.err, which remains un-updated until the job is manually cancelled. In this log, the main job ran over night ~8 hours without submitting any new jobs. I cancelled it the following morning.

Logs

Removing temporary output file output/New_Zealand_BLENHEIM_2000_3990/New_Zealand_BLENHEIM_2000_3990.sam.
[Tue Nov 17 23:32:05 2020]
Finished job 42963.
1080 of 86656 steps (1%) done
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.trimmed.fq.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.trimmed.fq.
[Tue Nov 17 23:35:55 2020]
Finished job 30728.
1081 of 86656 steps (1%) done

[Tue Nov 17 23:36:03 2020]
rule sam_to_bam:
input: output/Germany_2012_2110/Germany_2012_2110.sam
output: output/Germany_2012_2110/Germany_2012_2110.bam
jobid: 30727
wildcards: sample=Germany_2012_2110

Submitted job 30727 with external jobid 'Submitted batch job 6018753'.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110.sam.
[Tue Nov 17 23:41:39 2020]
Finished job 30727.
1082 of 86656 steps (1%) done
slurmstepd: error: *** JOB 6014276 ON cg17-3 CANCELLED AT 2020-11-18T07:56:39 ***

Minimal example
A minimal example to reproduce this would be any submission that submits thousands of jobs expected to run over several hours to a slurm system.

I was initially using v5.28.0, then tried downgrading after reading #724. I downgraded incrementally until 5.26 and kept encountering a stalled main job. I then jumped down to 5.3. After downgrading to 5.3.0, snakemake maintained the expected number of jobs in the queue until completion.

I have no idea why this snakemake-slurm timeout/miscommunication would be happening. I couldn't find a similar issue out there, and I thought others should be aware.

Thanks!
Kelly

The text was updated successfully, but these errors were encountered:

lucblassel · 2020-12-17T13:37:30Z

I had a similar problem recently and it turned out to be the job-status script that didn't give out the correct output all the time.

So now I use the following as slum-status.py

#!/usr/bin/env python3
import subprocess
import sys

jobid = sys.argv[-1]

output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())

running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED", "PREEMPTED"]
if "COMPLETED" in output:
  print("success")
elif any(r in output for r in running_status):
  print("running")
else:
  print("failed")

and submit my jobs with the following command

snakemake \
    --configfile <path/to/config> \
    --cluster "sbatch -c {threads} --mem {params.mem} -J {params.name} \
    --cluster-status <path/to/slurm-status.py>

marrip · 2021-08-20T09:28:42Z

Hey,

I had similar issues. I work with WGS and had to split some of my jobs in to several hundred subtasks ending up with about 15,000 jobs. I limited the job number to 100 and observed that over time it decreased to 40-30 jobs. I checked the head node resources and saw that snakemake was running at resource capacity (100% CPU). Would it be possible to give snakemake more threads or something similar?

johanneskoester · 2021-08-20T14:17:18Z

Thanks a lot for reporting. So, this can be basically two things:

If you have no job status script, Snakemake relies on status files to be generated. With lots of jobs and an unstable network file system, it can potentially be that these status files occur very late. Then, Snakemake will at some point wait for them to appear and not submit anymore. This should be visible by seeing no progress in terms of SNakemake reporting more and more jobs to be finished. To get more insight, you can run Snakemake with --verbose, so that you see the amount of free resources, ready jobs and which ones are selected.
If you have a status script, it can be that this somehow is killed by something else. So far, Snakemake silently ignored killed status scripts, and just retried after some seconds. This can however lead to the described behavior if just everything is always immediately killed or so. I have now changed this behavior in fix: improved error handling for cluster status scripts and smarter job selector choice in case of cluster submission (use greedy for single jobs). #1142 such that errors of this kind are detected and reported after 10 failures.

marrip · 2021-08-26T09:44:43Z

Hey Johannes,

I am using snakemake with DRMAA and as far as I understand it there are no status scripts involved there, right? Rather that snakemake communicates directly with slurm via DRMAA. I am not quite sure where the bootle neck is tbh but over 12 min time my running/queued jobs on the cluster declined from 100 to 30 and snakemake seems to be extremely busy (CPU 100%). Is there any way to configure snakemake/the cluster profile differently to handle over 10,000 jobs without experiencing this dip? Any advise would be highly appreciated 😃

johanneskoester · 2021-08-27T12:40:45Z

Thanks a lot for the letting us know @marrip. I think I found the issue for the DRMAA case. It should be fixed with PR #1156. We currently do not have a test case for DRMAA, could you try this out on your system?
Also, if you feel like getting into this, I would be very interested in a PR that adds a DRMAA test case (this is complicated because one first needs to setup DRMAA in the CI, e.g. via some docker container).

johanneskoester · 2021-09-24T12:11:24Z

I think all issues mentioned here should be resolved now and with the upcoming release. Feel free to reopen if I am wrong.

iamh2o · 2021-11-25T20:41:52Z

@johanneskoester - I'm doing a lot of work on a reasonably large sized cluster and have been hitting similar issues with jobs numbering in the 100,000+ range always stalling. Ive tried several things- rolling back pretty far helped alot.

But I just realized the cluster has drmaa and drmaa2 -- I'm sorting out getting drmaa working, and I would make an attempt at adding in support for dmraa2 - it's not backward compatible, but seems not very different from v1. I'd love to get a lightning overview of the pieces involved if there was somone up to chat for a bit.

thanks-
jm

joshsimcock · 2022-05-04T12:00:15Z

I've also experienced a very similar issue using a snakemake pipeline with an sge cluster.

snakemake version:
7.3.8

The problem seems to be a growing delay between jobs finishing on the cluster and snakemake realizing they have finished. Snakemake therefore thinks it is at the maximum jobs parameter and does not submit new jobs until it finally registers the older ones have finished.

This delay can be seen in the log for the main snakemake job. Snakemake generated jobs in one part of the pipeline all finish in around 60secs. However snakemake doesn't register them finished for a few minutes initially and this delay only grows and grows to eventually over an hour between a job finishing in reality and appearing as finished in the snakemake log. The submitted jobs diminish in number over time until they reach a minimal drip one at a time. For a large dataset with over 50k jobs this is a real killer.

I have no idea what is causing this behaviour and have tried several different job check scripts including the one provided for sge here but with no improvements.

MostafaYA · 2022-12-21T21:14:27Z

Similar issue here on a linux server (no cluster). Snakemake version 7.15.1
Switching on --scheduler greedy was helpful as noted here

Kelzor added the bug Something isn't working label Nov 20, 2020

nick-youngblut mentioned this issue Feb 15, 2021

"Resuming incomplete job" when it shouldn't #877

Open

johanneskoester self-assigned this Aug 20, 2021

johanneskoester mentioned this issue Aug 20, 2021

fix: improved error handling for cluster status scripts and smarter job selector choice in case of cluster submission (use greedy for single jobs). #1142

Merged

2 tasks

johanneskoester closed this as completed Sep 24, 2021

aryarm mentioned this issue Oct 26, 2021

Snakemake stops submitting jobs on SLURM cluster #1227

Open

corneliusroemer mentioned this issue Nov 4, 2021

Snakemake regularly falls asleep when running on cluster with slurm within tmux #1248

Open

dlaehnemann mentioned this issue Feb 3, 2023

continuously fewer jobs scheduled by snakemake on cluster (slurm), as jobs very slowly registered as finished by snakemake #2091

Closed

EisenRa added a commit to earthhologenome/EHI_bioinformatics that referenced this issue Jun 7, 2023

add new script from snakemake/snakemake#759 to help solve stalling

4d44c22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

Kelzor commented Nov 20, 2020 •

edited

lucblassel commented Dec 17, 2020

marrip commented Aug 20, 2021 •

edited

johanneskoester commented Aug 20, 2021

marrip commented Aug 26, 2021 •

edited

johanneskoester commented Aug 27, 2021

johanneskoester commented Sep 24, 2021

iamh2o commented Nov 25, 2021

joshsimcock commented May 4, 2022

MostafaYA commented Dec 21, 2022

snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

Comments

Kelzor commented Nov 20, 2020 • edited

lucblassel commented Dec 17, 2020

marrip commented Aug 20, 2021 • edited

johanneskoester commented Aug 20, 2021

marrip commented Aug 26, 2021 • edited

johanneskoester commented Aug 27, 2021

johanneskoester commented Sep 24, 2021

iamh2o commented Nov 25, 2021

joshsimcock commented May 4, 2022

MostafaYA commented Dec 21, 2022

Kelzor commented Nov 20, 2020 •

edited

marrip commented Aug 20, 2021 •

edited

marrip commented Aug 26, 2021 •

edited