New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
snakemake main job hangs indefinitely; no new jobs submitted on slurm in newer versions #759
Comments
I had a similar problem recently and it turned out to be the job-status script that didn't give out the correct output all the time. So now I use the following as #!/usr/bin/env python3
import subprocess
import sys
jobid = sys.argv[-1]
output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())
running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED", "PREEMPTED"]
if "COMPLETED" in output:
print("success")
elif any(r in output for r in running_status):
print("running")
else:
print("failed") and submit my jobs with the following command
|
Hey, I had similar issues. I work with WGS and had to split some of my jobs in to several hundred subtasks ending up with about 15,000 jobs. I limited the job number to 100 and observed that over time it decreased to 40-30 jobs. I checked the head node resources and saw that snakemake was running at resource capacity (100% CPU). Would it be possible to give snakemake more threads or something similar? |
Thanks a lot for reporting. So, this can be basically two things:
|
Hey Johannes, I am using snakemake with DRMAA and as far as I understand it there are no status scripts involved there, right? Rather that snakemake communicates directly with slurm via DRMAA. I am not quite sure where the bootle neck is tbh but over 12 min time my running/queued jobs on the cluster declined from 100 to 30 and snakemake seems to be extremely busy (CPU 100%). Is there any way to configure snakemake/the cluster profile differently to handle over 10,000 jobs without experiencing this dip? Any advise would be highly appreciated 😃 |
Thanks a lot for the letting us know @marrip. I think I found the issue for the DRMAA case. It should be fixed with PR #1156. We currently do not have a test case for DRMAA, could you try this out on your system? |
I think all issues mentioned here should be resolved now and with the upcoming release. Feel free to reopen if I am wrong. |
@johanneskoester - I'm doing a lot of work on a reasonably large sized cluster and have been hitting similar issues with jobs numbering in the 100,000+ range always stalling. Ive tried several things- rolling back pretty far helped alot. But I just realized the cluster has drmaa and drmaa2 -- I'm sorting out getting drmaa working, and I would make an attempt at adding in support for dmraa2 - it's not backward compatible, but seems not very different from v1. I'd love to get a lightning overview of the pieces involved if there was somone up to chat for a bit. thanks- |
I've also experienced a very similar issue using a snakemake pipeline with an sge cluster. snakemake version: The problem seems to be a growing delay between jobs finishing on the cluster and snakemake realizing they have finished. Snakemake therefore thinks it is at the maximum jobs parameter and does not submit new jobs until it finally registers the older ones have finished. This delay can be seen in the log for the main snakemake job. Snakemake generated jobs in one part of the pipeline all finish in around 60secs. However snakemake doesn't register them finished for a few minutes initially and this delay only grows and grows to eventually over an hour between a job finishing in reality and appearing as finished in the snakemake log. The submitted jobs diminish in number over time until they reach a minimal drip one at a time. For a large dataset with over 50k jobs this is a real killer. I have no idea what is causing this behaviour and have tried several different job check scripts including the one provided for sge here but with no improvements. |
Similar issue here on a linux server (no cluster). Snakemake version 7.15.1 |
Snakemake version
Version ≥26.0 (and possibly other newer versions)
Describe the bug
I am running a large snakemake pipeline with ~90k steps/jobs through a slurm cluster with the following submission commands:
snakemake --snakefile Snakefile -j 50 --use-conda --keep-target-files --keep-going --rerun-incomplete --latency-wait 30 --cluster "sbatch -A keblevin --mem=32768 -n 1 -c 8 -t 3:00:00 -e /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.err -o /scratch/keblevin/11_19_20_megapipe_MTBC_comp_data/slurm_outputs/slurm.%j.out"
New jobs slowly stop being submitted such that after 30 minutes to an hour only 30 jobs are maintained in the queue, then 20, then 10, etc. After about 1-2 hours or 800-1500 jobs, new jobs are no longer submitted at all, and the main job hangs indefinitely. The slurm.err log has no errors and also supports the observation that the main job is just hanging without submitting new jobs. Here are the last 25 lines of a slurm.err, which remains un-updated until the job is manually cancelled. In this log, the main job ran over night ~8 hours without submitting any new jobs. I cancelled it the following morning.
Logs
Removing temporary output file output/New_Zealand_BLENHEIM_2000_3990/New_Zealand_BLENHEIM_2000_3990.sam.
[Tue Nov 17 23:32:05 2020]
Finished job 42963.
1080 of 86656 steps (1%) done
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.sai.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.1.trimmed.fq.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110-modern.2.trimmed.fq.
[Tue Nov 17 23:35:55 2020]
Finished job 30728.
1081 of 86656 steps (1%) done
[Tue Nov 17 23:36:03 2020]
rule sam_to_bam:
input: output/Germany_2012_2110/Germany_2012_2110.sam
output: output/Germany_2012_2110/Germany_2012_2110.bam
jobid: 30727
wildcards: sample=Germany_2012_2110
Submitted job 30727 with external jobid 'Submitted batch job 6018753'.
Removing temporary output file output/Germany_2012_2110/Germany_2012_2110.sam.
[Tue Nov 17 23:41:39 2020]
Finished job 30727.
1082 of 86656 steps (1%) done
slurmstepd: error: *** JOB 6014276 ON cg17-3 CANCELLED AT 2020-11-18T07:56:39 ***
Minimal example
A minimal example to reproduce this would be any submission that submits thousands of jobs expected to run over several hours to a slurm system.
I was initially using v5.28.0, then tried downgrading after reading #724. I downgraded incrementally until 5.26 and kept encountering a stalled main job. I then jumped down to 5.3. After downgrading to 5.3.0, snakemake maintained the expected number of jobs in the queue until completion.
I have no idea why this snakemake-slurm timeout/miscommunication would be happening. I couldn't find a similar issue out there, and I thought others should be aware.
Thanks!
Kelly
The text was updated successfully, but these errors were encountered: