checkpoint: pipeline finishing early #823

nick-youngblut · 2021-01-03T15:54:36Z

I'm running snakemake 5.31.1 (bioconda)

When using checkpoint, the pipeline will often finish successfully without completing all steps or even generating all output files designated by the all rule.

For instance, here's the end of a recent run:

[Sun Jan  3 12:52:42 2021]
Finished job 4215.
2877 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/28/primers_126.txt.
[Sun Jan  3 12:52:43 2021]
Finished job 912.
2878 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/34/primers_191.txt.
[Sun Jan  3 12:52:43 2021]
Finished job 4515.
2879 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/28/primers_201.txt.
[Sun Jan  3 12:52:44 2021]
Finished job 1437.
2880 of 6702 steps (43%) done
Complete log: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/.snakemake/log/2021-01-03T115355.280837.snakemake.log

Pipeline complete! Creating report...

If I try to "finish" the run by re-running snakemake, snakemake will start from the beginning of the pipeline instead of picking up where it left off.

I don't have a minimal example as of yet. Maybe others have (or will soon) run into this problem and post a minimal example from a simpler pipeline than mine.

The only checkpoint in my pipeline generates a large number of tsv files, and the output for that checkpoint rule is a directory. My aggregation function is:

def aggregate_primer_info(wildcards):
    chk_out = checkpoints.clusters_core_genes.get(**wildcards).output[0]
    F = expand(cgp_dir + 'primers_final/{plen}/primers_{x}.tsv',
               plen = config['params']['cgp']['primerprospector']['primer_lengths'],
               x = glob_wildcards(os.path.join(chk_out, 'clusters_{i}.fasta')).i)
    return F

I don't see why snakemake is ending the run early, especially since the final, aggregated output files designated in the all rule were not created upon the early-finish.

The text was updated successfully, but these errors were encountered:

epruesse · 2021-01-05T22:57:37Z

Could you create a minimal (well, small enough to post here) snakefile that reproduces this error? I'm seeing the same, but haven't gotten around to building that. Fixing becomes much more feasible if it is reproducible.

cc @jmeppley

nick-youngblut · 2021-01-15T09:01:46Z

I'm still having problems creating a reproducible minimal example, but I am still definitely running into this problem. I ran a pipeline that finished nearly at the end after a couple of days:

[Thu Jan 14 21:44:51 2021]
Finished job 67.
72 of 87 steps (83%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/blastx/hits.txt.
[Thu Jan 14 22:09:43 2021]
Finished job 64.
73 of 87 steps (84%) done
Complete log: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/.snakemake/log/2021-01-14T195420.533178.snakemake.log

Pipeline complete! Creating report...

but then when I try to restart snakemake to finish those last steps, snakemake just starts from the beginning 😞

jmeppley · 2021-01-15T20:55:49Z

It sounds like you are running up against two separate problems. First the workflow isn't running to completion, and that's probably connected to the checkpoint (whether or not it's the same bug or a different one). However, the workflow starting over from the beginning is likely a totally different problem. In all my issues with checkpoints, I haven't had the workflow start over from the beginning on a re-run. But I've often had other workflows start over from the beginning because of something weird in my Snakefile. I suggest debugging the restart behavior separately. The next time it stops without finishing, use --dry-run and --reason to figure out what is triggering the full re-run.

…

On Fri, Jan 15, 2021 at 1:02 AM Nick Youngblut ***@***.***> wrote: I'm still having problems creating a reproducible minimal example, but I am still definitely running into this problem. I ran a pipeline that finished nearly at the end after a couple of days: [Thu Jan 14 21:44:51 2021] Finished job 67. 72 of 87 steps (83%) done Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/blastx/hits.txt. [Thu Jan 14 22:09:43 2021] Finished job 64. 73 of 87 steps (84%) done Complete log: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/.snakemake/log/2021-01-14T195420.533178.snakemake.log Pipeline complete! Creating report... but then when I try to restart snakemake to finish those last steps, snakemake just starts from the beginning 😞 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#823 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANDP3HVG66IDK5WM52DTWLS2AAATANCNFSM4VR7CGDA> .

epruesse · 2021-01-26T18:25:04Z

I think there may be an issue with .snakemake conversion from <5.27 format to >=5.27 format. I have seen similar problems, and after removing .snakemake, they haven't resurfaced. There was also an issue with the conversion with long hashes where they turned into directories, potentially further corrupting the directory. Try to start fresh and see whether you can still reproduce.

AleSR13 · 2021-04-10T05:33:00Z

Hello, I think I might have the same issue. I have a rule to demultiplex Illumina data and I don't know which otuput I expect (or the number of files at least). I use checkpoints to continue the analysis (fastqc, multiqc, etc) after the checkpoint. The problem is that it runs fine until the checkpoint,. Then it properly (as far as I can tell) recalculates the DAG and then it stops (3% complete but exit 0) so it didn't really run the downstream rules. If I restart the pipeline, it starts where it left and properly finishes the jobs. I guess I can just put it in production like this, but it is not ideal. I am using snakemake v 6.1.1 (in conda environment).

Maarten-vd-Sande · 2021-09-21T14:30:33Z

This bug still exists. I don't have it with 5.18, but I do have it with higher version (I didn't pin-point the specific release). 6.8.0 still has it. 😭

siebrenf · 2021-09-21T15:38:09Z

Here's a minimal reproducible example.

Tested in a clean conda environment with snakemake 6.8.0
Runs with rm -f a1.txt; rm -rf .snakemake; snakemake --cores 1

rule all:
    input:
        "b.txt"

checkpoint a:
    output:
        # to reproduce the bug:
        # >1 output required
        # one or more temp() outputs required
        # exception: will work if only the first output is temp()
        temp('a1.txt'),
        temp('a2.txt'),
    shell:
        "touch {output}"

def _checkpoint_output(wildcards):
    out = checkpoints.a.get(**wildcards).output
    return out

rule b:
    input:
        _checkpoint_output,
    output:
        temp('b.txt')
    shell:
        "touch {output}"

about the command

The command used to run this file removes the .snakemake folder. This is optional, but ensures it's not the same issue as described above. It also removes file a1.txt as this remains after the incomplete run. Running the code a second time without removing a1.txt completes the run.

about the code

Most temp()s are added for fast rerunning only. The only one that matters is the second temp() in the checkpoint (as explained in the comments).
Using the input function _checkpoint_output() in rule b is required to reproduce the bug.

my 2 cents

This example somewhat compares to our (@Maarten-vd-Sande and I) use-case, in that it uses a similar input function downstream of the checkpoint (and only the rules using it are affected). However, our workflow's checkpoint has only one output. The issue is therefore probably wider, and my guess would be it is related to input functions using checkpoints.

siebrenf · 2021-09-21T18:50:32Z

And here's a reproducible example matching our use-case!

Runs with rm -r a.txt; rm -f b.txt; rm -f c.txt; rm .snakemake; snakemake --cores 2

rule all:
    input:
        "c.txt",

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

rule b1:
    output:
        pipe("b.pipe")
    shell:
        ""

rule b2:
    input:
        pipe("b.pipe")
    output:
        "b.txt"
    shell:
        "touch {output}"

rule c:
    input:
        "a.txt",
        "b.txt",
    output:
        "c.txt"
    shell:
        "touch {output}"

The input function from the previous example did not matter for this one. Instead it's piping (similar to #132)!

EDIT: extended the example with rule c to clearly distinguish this bug with the bug in the comment below.

siebrenf · 2021-09-21T19:02:32Z

I found one more edge case while making the above example (sorry!)

the following code works fine:

rule all:
    input:
        "b.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

rule b:
    input:
        "a.txt"
    output:
        "b.txt"
    shell:
        "touch {output}"

but this stops early:

rule all:
    input:
        "a.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

in this case, it seems the target rule cannot contain files coming directly from a checkpoint.

EDIT: simplified.

epruesse · 2021-10-01T22:18:12Z

So this one:

rule all:
    input:
        "a.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

That sounds like it's just some overly aggressive avoiding of running rules. If you add a shell: statement to the all rule, it works. If you add an output:, it also works. The only "bug" is that the logging shows only 1/2 jobs completed because the second job gets deleted from the dag and then there is nothing left to do - exit. The only real issue would be if you wanted a message: to be printed, which won't happen.

epruesse · 2021-10-01T23:50:37Z

Ok, so the problem with the first MRE is as follows: The job for checkpoint a is executed and a1.txt and a2.txt are created. The job finishes, and the temp files that aren't needed are deleted. This is verified by checking the depends of all jobs for each file. Jobs for rules depending on checkpoints do not have output yet, so they aren't in the list, so files needed only by those rules get deleted. The reason why the first temp() works is that this file is actually added to the required files list as a stub.

MRE 1 works fine if you pass --no-temp and temp deletion is disabled.

The output of checkpoint rules may be needed by other rules, which, at this point have not had a chance to add their file requirements to their depending list.

- Can't have the b.txt in first text be temp and expect it present, it will correctly be deleted. - Add third MRE which is targeting more of a log issue

The main scheduler loop terminates if there are no more runnable jobs. This tests if there still are jobs that should be run, but that we cannot seem to get to. If so, print an error message, print the files affected, and exit 1. The latter is important so Snakemake can, as hotfix, be rerun in a loop until it's actually completed all tasks.

aryarm · 2021-10-11T16:48:00Z

@epruesse and @siebrenf, could the second MRE in that list have anything to do with #1173, which was merged into v6.8.1?

I ran the second test with both Snakemake v6.8.0 and v6.8.1. It failed on v6.8.0 but had the expected output on v6.8.1.

DonFreed · 2021-10-11T17:17:40Z

I was able to reproduce the second MRE with v6.9.1. The failure is not deterministic works about 33% of the time.

It could just be that you got (un)lucky in your test with v6.8.1?

aryarm · 2021-10-11T17:41:32Z

It could just be that you got (un)lucky in your test with v6.8.1?

Yup, that was it! I ran it 15 more times. It failed on 7 of those.

@epruesse

- Incorporates @epruesse's fix for MRE snakemake#1 - Adds a fix for MRE snakemake#2 - properly marks group jobs as finished - Some minor updates to tests

@epruesse

* add failing tests 823 * fix mistakes * black * Fix the first two MREs from #823. - Incorporates @epruesse's fix for MRE #1 - Adds a fix for MRE #2 - properly marks group jobs as finished - Some minor updates to tests * Fix tests on Windows * Skip MRE 2 from 823 on Windows due to `pipe()` output Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>

@epruesse

* add failing tests 823 * fix mistakes * black * Fix the first two MREs from snakemake#823. - Incorporates @epruesse's fix for MRE #1 - Adds a fix for MRE #2 - properly marks group jobs as finished - Some minor updates to tests * Fix tests on Windows * Skip MRE 2 from 823 on Windows due to `pipe()` output Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>

* add failing tests 823 * fix mistakes * black * Update tests from #823 MREs - Can't have the b.txt in first text be temp and expect it present, it will correctly be deleted. - Add third MRE which is targeting more of a log issue * Add error message printed if #823 is encountered, also exit 1 The main scheduler loop terminates if there are no more runnable jobs. This tests if there still are jobs that should be run, but that we cannot seem to get to. If so, print an error message, print the files affected, and exit 1. The latter is important so Snakemake can, as hotfix, be rerun in a loop until it's actually completed all tasks. * Fix checkpoint+temp leads to incomplete run issue The aggressive early deletion of temp files must ignore checkpoints. Other rules depend on a checkpoint only via the first output file of the checkpoint (flagged string) as a stub. All other outputs marked as temp would be deleted (and have been before this patch). Since those files are missing, the dependent rules will never be runnable and the workflow ends before all targets have been built. * Don't break untiljobs * fix typo Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>

johanneskoester · 2022-02-21T09:54:45Z

Should be fixed by merging PR #1203. Thanks to all of you for the detective work, the fixes and the examples!

ASLeonard · 2022-03-01T09:52:20Z

After upgrading to v7.0.0, I've come across this error when using snakemake on a cluster. I'm still testing it some more, but it appears that when a job failed (because of a known/valid reason), when all jobs are done on the cluster, snakemake then exits with this error specifying that the job that failed on the cluster is a "remaining job".

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - <rule_name>: <file>

corneliusroemer · 2022-04-05T17:36:26Z

Came across this bug (?) in v7.3.0. Not sure why

Is this really a bug? The message is confusing. @johanneskoester

nh13 · 2022-04-15T07:44:47Z

I also hit this, but in 7.3.8. There were upstream jobs that failed, and I had --keep-going set. I think it was a red herring.

corneliusroemer · 2022-04-15T14:19:01Z

Ah yes, I think I hut it in the same situation and it sounds right, red herring.

Probably new issue should be opened to deal with it.

FuduanPeng · 2022-05-20T20:54:04Z

it seems that this problem happens randomly, most of my jobs were successfully finished, but only 4 of them went unfinished.

chrarnold · 2022-05-30T13:37:05Z

I also had this in v7.8 last week, can someone comment on how to solve it? Is --keep-going responsible, and if so, how? I need --keep-going, cant remove it usually.

epruesse · 2022-07-05T22:43:02Z

@chrarnold You can use --keep-going safely. It's the same as with make - it just is fast vs slow fail. Probably best to have on most of the time, so that you come back Monday morning to everything done except for that one sample with 15 reads after qc (instead of to almost nothing done because it started no further jobs after the first one failed).

johanneskoester · 2023-11-29T09:42:51Z

This is indeed a bug. A PR with a testcase that reproduces this would be very welcome. We think that at least one way to trigger this is by combining checkpoints with group jobs.

johanneskoester · 2023-11-29T09:43:17Z

cc @christopher-schroeder

…#823

Erythroxylum · 2024-04-15T14:55:01Z

Same bug occurred with ecoli test run:
snakemake v7.32.4
Python v3.11.4

All jobs finished after restarting with --keep-going
Not sure if it would have finished without that flag.

nick-youngblut added the bug Something isn't working label Jan 3, 2021

a-ludi mentioned this issue Feb 19, 2021

Attempt to create files a second time in a single snakemake run #884

Closed

johanneskoester mentioned this issue Sep 24, 2021

test: add tests for issue 823 #1185

Closed

2 tasks

epruesse added a commit to epruesse/snakemake that referenced this issue Oct 2, 2021

Update tests from snakemake#823 MREs

315f769

- Can't have the b.txt in first text be temp and expect it present, it will correctly be deleted. - Add third MRE which is targeting more of a log issue

epruesse mentioned this issue Oct 2, 2021

fix: Repair MREs from #823 #1203

Merged

5 tasks

DonFreed added a commit to DonFreed/snakemake that referenced this issue Oct 12, 2021

Fix the first two MREs from snakemake#823.

5cbefd2

- Incorporates @epruesse's fix for MRE snakemake#1 - Adds a fix for MRE snakemake#2 - properly marks group jobs as finished - Some minor updates to tests

DonFreed mentioned this issue Oct 12, 2021

fix: Adds fixes for the first two MREs in #823 #1215

Merged

2 tasks

johanneskoester closed this as completed Feb 21, 2022

ptrebert mentioned this issue Mar 6, 2022

beta2: segfault in utgcns and bug in Snakemake marbl/verkko#53

Closed

brianwalenz mentioned this issue Apr 19, 2022

verkko stops with no error just after buildPackages marbl/verkko#64

Closed

chrarnold mentioned this issue May 30, 2022

BUG: Out of jobs ready to be started, but not all files built yet. #1687

Closed

DanielSprockett mentioned this issue Jun 28, 2022

multiQC version issue CUMoellerLab/sn-mg-pipeline#64

Closed

miRNA67 mentioned this issue Sep 25, 2022

Snakemake bug leylabmpi/CoreGenomePrimers#3

Closed

yanhui09 mentioned this issue Oct 25, 2022

Temp file get removed early if it is still an input of one unfinished rule after checkpoint, resulting in early finish #1930

Closed

jsgounot mentioned this issue Nov 8, 2022

Ignore rerun when update for another job #1964

Open

johanneskoester reopened this Nov 29, 2023

jmtsuji mentioned this issue Feb 15, 2024

QC read files deleted too early rotary-genomics/rotary#132

Open

ThomasBrazier pushed a commit to ThomasBrazier/snpArcher-dev that referenced this issue Apr 5, 2024

Fixed checkpoint error: snakemake/snakemake#1930, snakemake/snakemake…

6a50038

…#823

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint: pipeline finishing early #823

checkpoint: pipeline finishing early #823

nick-youngblut commented Jan 3, 2021

epruesse commented Jan 5, 2021

nick-youngblut commented Jan 15, 2021

jmeppley commented Jan 15, 2021 via email

epruesse commented Jan 26, 2021

AleSR13 commented Apr 10, 2021

Maarten-vd-Sande commented Sep 21, 2021

siebrenf commented Sep 21, 2021

siebrenf commented Sep 21, 2021 •

edited

siebrenf commented Sep 21, 2021 •

edited

epruesse commented Oct 1, 2021

epruesse commented Oct 1, 2021

aryarm commented Oct 11, 2021 •

edited

DonFreed commented Oct 11, 2021 •

edited

aryarm commented Oct 11, 2021 •

edited

johanneskoester commented Feb 21, 2022

ASLeonard commented Mar 1, 2022

corneliusroemer commented Apr 5, 2022

nh13 commented Apr 15, 2022 •

edited

corneliusroemer commented Apr 15, 2022

FuduanPeng commented May 20, 2022

chrarnold commented May 30, 2022

epruesse commented Jul 5, 2022

johanneskoester commented Nov 29, 2023 •

edited

johanneskoester commented Nov 29, 2023

Erythroxylum commented Apr 15, 2024

checkpoint: pipeline finishing early #823

checkpoint: pipeline finishing early #823

Comments

nick-youngblut commented Jan 3, 2021

epruesse commented Jan 5, 2021

nick-youngblut commented Jan 15, 2021

jmeppley commented Jan 15, 2021 via email

epruesse commented Jan 26, 2021

AleSR13 commented Apr 10, 2021

Maarten-vd-Sande commented Sep 21, 2021

siebrenf commented Sep 21, 2021

about the command

about the code

my 2 cents

siebrenf commented Sep 21, 2021 • edited

siebrenf commented Sep 21, 2021 • edited

epruesse commented Oct 1, 2021

epruesse commented Oct 1, 2021

aryarm commented Oct 11, 2021 • edited

DonFreed commented Oct 11, 2021 • edited

aryarm commented Oct 11, 2021 • edited

johanneskoester commented Feb 21, 2022

ASLeonard commented Mar 1, 2022

corneliusroemer commented Apr 5, 2022

nh13 commented Apr 15, 2022 • edited

corneliusroemer commented Apr 15, 2022

FuduanPeng commented May 20, 2022

chrarnold commented May 30, 2022

epruesse commented Jul 5, 2022

johanneskoester commented Nov 29, 2023 • edited

johanneskoester commented Nov 29, 2023

Erythroxylum commented Apr 15, 2024

siebrenf commented Sep 21, 2021 •

edited

siebrenf commented Sep 21, 2021 •

edited

aryarm commented Oct 11, 2021 •

edited

DonFreed commented Oct 11, 2021 •

edited

aryarm commented Oct 11, 2021 •

edited

nh13 commented Apr 15, 2022 •

edited

johanneskoester commented Nov 29, 2023 •

edited