Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759

Closed
Kelzor opened this issue Nov 20, 2020 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@Kelzor
Copy link

Kelzor commented Nov 20, 2020

Snakemake version
Version ≥26.0 (and possibly other newer versions)

Describe the bug

I am running a large snakemake pipeline with ~90k steps/jobs through a slurm cluster with the following submission commands:

snakemake --snakefile Snakefile -j 50 --use-conda --keep-target-files --keep-going --rerun-incomplete --latency-wait 30 --cluster "sbatch -A keblevin --mem=32768 -n 1 -c 8 -t 3:00:00 -e /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.err -o /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.out"

New jobs slowly stop being submitted such that after 30 minutes to an hour only 30 jobs are maintained in the queue, then 20, then 10, etc. After about 1-2 hours or 800-1500 jobs, new jobs are no longer submitted at all, and the main job hangs indefinitely. The slurm.err log has no errors and also supports the observation that the main job is just hanging without submitting new jobs. Here are the last 25 lines of a slurm.err, which remains un-updated until the job is manually cancelled. In this log, the main job ran over night ~8 hours without submitting any new jobs. I cancelled it the following morning.

Logs

Removing temporary output file output/New_Zealand_BLENHEIM_2000_3990/New_Zealand_BLENHEIM_2000_3990.sam.
[Tue Nov 17 23:32:05 2020]
Finished job 42963.
1080 of 86656 steps (1%) done
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.trimmed.fq.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.trimmed.fq.
[Tue Nov 17 23:35:55 2020]
Finished job 30728.
1081 of 86656 steps (1%) done

[Tue Nov 17 23:36:03 2020]
rule sam_to_bam:
input: output/Germany_2012_2110/Germany_2012_2110.sam
output: output/Germany_2012_2110/Germany_2012_2110.bam
jobid: 30727
wildcards: sample=Germany_2012_2110

Submitted job 30727 with external jobid 'Submitted batch job 6018753'.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110.sam.
[Tue Nov 17 23:41:39 2020]
Finished job 30727.
1082 of 86656 steps (1%) done
slurmstepd: error: *** JOB 6014276 ON cg17-3 CANCELLED AT 2020-11-18T07:56:39 ***

Minimal example
A minimal example to reproduce this would be any submission that submits thousands of jobs expected to run over several hours to a slurm system.

I was initially using v5.28.0, then tried downgrading after reading #724. I downgraded incrementally until 5.26 and kept encountering a stalled main job. I then jumped down to 5.3. After downgrading to 5.3.0, snakemake maintained the expected number of jobs in the queue until completion.

I have no idea why this snakemake-slurm timeout/miscommunication would be happening. I couldn't find a similar issue out there, and I thought others should be aware.

Thanks!
Kelly

@Kelzor Kelzor added the bug Something isn't working label Nov 20, 2020
@lucblassel
Copy link

I had a similar problem recently and it turned out to be the job-status script that didn't give out the correct output all the time.

So now I use the following as slum-status.py

#!/usr/bin/env python3
import subprocess
import sys

jobid = sys.argv[-1]

output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())

running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED", "PREEMPTED"]
if "COMPLETED" in output:
  print("success")
elif any(r in output for r in running_status):
  print("running")
else:
  print("failed")

and submit my jobs with the following command

snakemake \
    --configfile <path/to/config> \
    --cluster "sbatch -c {threads} --mem {params.mem} -J {params.name} \
    --cluster-status <path/to/slurm-status.py>

@marrip
Copy link

marrip commented Aug 20, 2021

Hey,

I had similar issues. I work with WGS and had to split some of my jobs in to several hundred subtasks ending up with about 15,000 jobs. I limited the job number to 100 and observed that over time it decreased to 40-30 jobs. I checked the head node resources and saw that snakemake was running at resource capacity (100% CPU). Would it be possible to give snakemake more threads or something similar?

@johanneskoester
Copy link
Contributor

Thanks a lot for reporting. So, this can be basically two things:

  1. If you have no job status script, Snakemake relies on status files to be generated. With lots of jobs and an unstable network file system, it can potentially be that these status files occur very late. Then, Snakemake will at some point wait for them to appear and not submit anymore. This should be visible by seeing no progress in terms of SNakemake reporting more and more jobs to be finished. To get more insight, you can run Snakemake with --verbose, so that you see the amount of free resources, ready jobs and which ones are selected.
  2. If you have a status script, it can be that this somehow is killed by something else. So far, Snakemake silently ignored killed status scripts, and just retried after some seconds. This can however lead to the described behavior if just everything is always immediately killed or so. I have now changed this behavior in fix: improved error handling for cluster status scripts and smarter job selector choice in case of cluster submission (use greedy for single jobs). #1142 such that errors of this kind are detected and reported after 10 failures.

@marrip
Copy link

marrip commented Aug 26, 2021

Hey Johannes,

I am using snakemake with DRMAA and as far as I understand it there are no status scripts involved there, right? Rather that snakemake communicates directly with slurm via DRMAA. I am not quite sure where the bootle neck is tbh but over 12 min time my running/queued jobs on the cluster declined from 100 to 30 and snakemake seems to be extremely busy (CPU 100%). Is there any way to configure snakemake/the cluster profile differently to handle over 10,000 jobs without experiencing this dip? Any advise would be highly appreciated 😃

@johanneskoester
Copy link
Contributor

Thanks a lot for the letting us know @marrip. I think I found the issue for the DRMAA case. It should be fixed with PR #1156. We currently do not have a test case for DRMAA, could you try this out on your system?
Also, if you feel like getting into this, I would be very interested in a PR that adds a DRMAA test case (this is complicated because one first needs to setup DRMAA in the CI, e.g. via some docker container).

@johanneskoester
Copy link
Contributor

I think all issues mentioned here should be resolved now and with the upcoming release. Feel free to reopen if I am wrong.

@iamh2o
Copy link
Contributor

iamh2o commented Nov 25, 2021

@johanneskoester - I'm doing a lot of work on a reasonably large sized cluster and have been hitting similar issues with jobs numbering in the 100,000+ range always stalling. Ive tried several things- rolling back pretty far helped alot.

But I just realized the cluster has drmaa and drmaa2 -- I'm sorting out getting drmaa working, and I would make an attempt at adding in support for dmraa2 - it's not backward compatible, but seems not very different from v1. I'd love to get a lightning overview of the pieces involved if there was somone up to chat for a bit.

thanks-
jm

@joshsimcock
Copy link

I've also experienced a very similar issue using a snakemake pipeline with an sge cluster.

snakemake version:
7.3.8

The problem seems to be a growing delay between jobs finishing on the cluster and snakemake realizing they have finished. Snakemake therefore thinks it is at the maximum jobs parameter and does not submit new jobs until it finally registers the older ones have finished.

This delay can be seen in the log for the main snakemake job. Snakemake generated jobs in one part of the pipeline all finish in around 60secs. However snakemake doesn't register them finished for a few minutes initially and this delay only grows and grows to eventually over an hour between a job finishing in reality and appearing as finished in the snakemake log. The submitted jobs diminish in number over time until they reach a minimal drip one at a time. For a large dataset with over 50k jobs this is a real killer.

I have no idea what is causing this behaviour and have tried several different job check scripts including the one provided for sge here but with no improvements.

@MostafaYA
Copy link

Similar issue here on a linux server (no cluster). Snakemake version 7.15.1
Switching on --scheduler greedy was helpful as noted here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants