Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupjob fails when downstream of checkpoint (re-evaluation of DAG bug?) #1331

Closed
Maarten-vd-Sande opened this issue Jan 13, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@Maarten-vd-Sande
Copy link
Contributor

Maarten-vd-Sande commented Jan 13, 2022

Snakemake version
6.12.3, seems to be introduced in version 6.5.1 by commit 4dbb7ad

Describe the bug

I have a checkpoint trimming, and a follow up align and sort groupjob. When I run this with 2 cores, all is well. However when I run it with 4 cores, what I think happens is: trimming 1 finishes -> re-evaluate DAG -> align+sort -> trimming 2 finishes -> re-evaluate DAG -> CRASH because align+sort output already exists but rule hasn't finished yet.

Minimal example

Running this with 2 cores works, but running with four cores causes a crash!! Note that it isn't guaranteed to happen, so you might need to re-run this once or twice..

rule all:
    input:
        [
         "output/aligned_and_sort/1.txt",
         "output/aligned_and_sort/2.txt",
        ]


checkpoint trimming:
    output:
        "output/trimmed/{sample}.txt"
    shell:
        "touch {output}; sleep 1"


rule align:
    input:
        "output/trimmed/{sample}.txt"
    output:
        pipe("output/aligned/{sample}.txt")
    shell:
        "touch {output}; sleep 1"


rule sort:
    input:
        "output/aligned/{sample}.txt"
    output:
        "output/aligned_and_sort/{sample}.txt"
    shell:
        "touch {output}; sleep 1"

output

snakemake --cores 4
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
align           2              1              1
all             1              1              1
sort            2              1              1
trimming        2              1              1
total           7              1              1

Select jobs to execute...

[Wed Jan 19 09:34:38 2022]
checkpoint trimming:
    output: output/trimmed/1.txt
    jobid: 3
    wildcards: sample=1
    resources: tmpdir=/tmp
Downstream jobs will be updated after completion.


[Wed Jan 19 09:34:38 2022]
checkpoint trimming:
    output: output/trimmed/2.txt
    jobid: 6
    wildcards: sample=2
    resources: tmpdir=/tmp
Downstream jobs will be updated after completion.

[Wed Jan 19 09:34:39 2022]
Finished job 3.
1 of 7 steps (14%) done
Updating job align.
Select jobs to execute...
[Wed Jan 19 09:34:39 2022]

group job bdb68102-7e52-4fa6-ac1d-6c3eb711d5fd (jobs in lexicogr. order):

    [Wed Jan 19 09:34:39 2022]
    rule align:
        input: output/trimmed/1.txt
        output: output/aligned/1.txt (pipe)
        jobid: 2
        wildcards: sample=1
        resources: tmpdir=/tmp


    [Wed Jan 19 09:34:39 2022]
    rule sort:
        input: output/aligned/1.txt
        output: output/aligned_and_sort/1.txt
        jobid: 1
        wildcards: sample=1
        resources: tmpdir=/tmp

[Wed Jan 19 09:34:39 2022]
Finished job 6.
2 of 7 steps (29%) done
Updating job align.
Select jobs to execute...
[Wed Jan 19 09:34:39 2022]

group job bdb68102-7e52-4fa6-ac1d-6c3eb711d5fd (jobs in lexicogr. order):

    [Wed Jan 19 09:34:39 2022]
    rule align:
        input: output/trimmed/1.txt
        output: output/aligned/1.txt (pipe)
        jobid: 2
        wildcards: sample=1
        resources: tmpdir=/tmp


    [Wed Jan 19 09:34:39 2022]
    rule sort:
        input: output/aligned/1.txt
        output: output/aligned_and_sort/1.txt
        jobid: 1
        wildcards: sample=1
        resources: tmpdir=/tmp

Warning: the following output files of rule align were not present when the DAG was created:
{'output/aligned/1.txt'}
Warning: the following output files of rule sort were not present when the DAG was created:
{'output/aligned_and_sort/1.txt'}
[Wed Jan 19 09:34:41 2022]
Finished job 2.
[Wed Jan 19 09:34:41 2022]
Finished job 1.
4 of 7 steps (57%) done
Select jobs to execute...
[Wed Jan 19 09:34:41 2022]

group job bdb68102-7e52-4fa6-ac1d-6c3eb711d5fd (jobs in lexicogr. order):

    [Wed Jan 19 09:34:41 2022]
    rule align:
        input: output/trimmed/2.txt
        output: output/aligned/2.txt (pipe)
        jobid: 5
        wildcards: sample=2
        resources: tmpdir=/tmp


    [Wed Jan 19 09:34:41 2022]
    rule sort:
        input: output/aligned/2.txt
        output: output/aligned_and_sort/2.txt
        jobid: 4
        wildcards: sample=2
        resources: tmpdir=/tmp

[Wed Jan 19 09:34:41 2022]
Error in group job bdb68102-7e52-4fa6-ac1d-6c3eb711d5fd:
    [Wed Jan 19 09:34:41 2022]
    Error in rule sort:
        jobid: 1
        output: output/aligned_and_sort/1.txt
        shell:
        touch output/aligned_and_sort/1.txt; sleep 1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

    [Wed Jan 19 09:34:41 2022]
    Error in rule align:
        jobid: 2
        output: output/aligned/1.txt (pipe)
        shell:
        touch output/aligned/1.txt; sleep 1
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job sort since they might be corrupted:
output/aligned_and_sort/1.txt
Traceback (most recent call last):
  File "/home/sande/miniconda3/envs/seq2science/lib/python3.8/site-packages/snakemake/__init__.py", line 699, in snakemake
    success = workflow.execute(
  File "/home/sande/miniconda3/envs/seq2science/lib/python3.8/site-packages/snakemake/workflow.py", line 1073, in execute
    success = self.scheduler.schedule()
  File "/home/sande/miniconda3/envs/seq2science/lib/python3.8/site-packages/snakemake/scheduler.py", line 441, in schedule
    self._error_jobs()
  File "/home/sande/miniconda3/envs/seq2science/lib/python3.8/site-packages/snakemake/scheduler.py", line 557, in _error_jobs
    self._handle_error(job)
  File "/home/sande/miniconda3/envs/seq2science/lib/python3.8/site-packages/snakemake/scheduler.py", line 615, in _handle_error
    self.running.remove(job)
KeyError: JobGroup(bdb68102-7e52-4fa6-ac1d-6c3eb711d5fd,frozenset({sort, align}))
@Maarten-vd-Sande Maarten-vd-Sande added the bug Something isn't working label Jan 13, 2022
johanneskoester added a commit that referenced this issue Feb 18, 2022
…he same group id to different groups; bug that accidentally added already running groups of the list of ready jobs (issue #1331) (#1332)

* issue 1331

* Update Snakefile

* Update Snakefile

* fix: bug in pipe group handling that led to multiple assignments of the same group id to different groups; bug that accidentally added already running groups of the list of ready jobs

* fmt

* skip on win

Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
@johanneskoester
Copy link
Contributor

I think this is solved now. Closing, but please reopen if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants