Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint: pipeline finishing early #823

Open
nick-youngblut opened this issue Jan 3, 2021 · 25 comments
Open

checkpoint: pipeline finishing early #823

nick-youngblut opened this issue Jan 3, 2021 · 25 comments
Labels
bug Something isn't working

Comments

@nick-youngblut
Copy link

I'm running snakemake 5.31.1 (bioconda)

When using checkpoint, the pipeline will often finish successfully without completing all steps or even generating all output files designated by the all rule.

For instance, here's the end of a recent run:

[Sun Jan  3 12:52:42 2021]
Finished job 4215.
2877 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/28/primers_126.txt.
[Sun Jan  3 12:52:43 2021]
Finished job 912.
2878 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/34/primers_191.txt.
[Sun Jan  3 12:52:43 2021]
Finished job 4515.
2879 of 6702 steps (43%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/primers_raw/28/primers_201.txt.
[Sun Jan  3 12:52:44 2021]
Finished job 1437.
2880 of 6702 steps (43%) done
Complete log: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/.snakemake/log/2021-01-03T115355.280837.snakemake.log

Pipeline complete! Creating report...

If I try to "finish" the run by re-running snakemake, snakemake will start from the beginning of the pipeline instead of picking up where it left off.

I don't have a minimal example as of yet. Maybe others have (or will soon) run into this problem and post a minimal example from a simpler pipeline than mine.

The only checkpoint in my pipeline generates a large number of tsv files, and the output for that checkpoint rule is a directory. My aggregation function is:

def aggregate_primer_info(wildcards):
    chk_out = checkpoints.clusters_core_genes.get(**wildcards).output[0]
    F = expand(cgp_dir + 'primers_final/{plen}/primers_{x}.tsv',
               plen = config['params']['cgp']['primerprospector']['primer_lengths'],
               x = glob_wildcards(os.path.join(chk_out, 'clusters_{i}.fasta')).i)
    return F

I don't see why snakemake is ending the run early, especially since the final, aggregated output files designated in the all rule were not created upon the early-finish.

@nick-youngblut nick-youngblut added the bug Something isn't working label Jan 3, 2021
@epruesse
Copy link
Contributor

epruesse commented Jan 5, 2021

Could you create a minimal (well, small enough to post here) snakefile that reproduces this error? I'm seeing the same, but haven't gotten around to building that. Fixing becomes much more feasible if it is reproducible.

cc @jmeppley

@nick-youngblut
Copy link
Author

I'm still having problems creating a reproducible minimal example, but I am still definitely running into this problem. I ran a pipeline that finished nearly at the end after a couple of days:

[Thu Jan 14 21:44:51 2021]
Finished job 67.
72 of 87 steps (83%) done
Removing temporary output file /ebio/abt3_scratch/nyoungblut/LLPRIMER_101310211274/blastx/hits.txt.
[Thu Jan 14 22:09:43 2021]
Finished job 64.
73 of 87 steps (84%) done
Complete log: /ebio/abt3_projects/software/dev/ll_pipelines/llprimer/.snakemake/log/2021-01-14T195420.533178.snakemake.log

Pipeline complete! Creating report...

but then when I try to restart snakemake to finish those last steps, snakemake just starts from the beginning 😞

@jmeppley
Copy link
Contributor

jmeppley commented Jan 15, 2021 via email

@epruesse
Copy link
Contributor

I think there may be an issue with .snakemake conversion from <5.27 format to >=5.27 format. I have seen similar problems, and after removing .snakemake, they haven't resurfaced. There was also an issue with the conversion with long hashes where they turned into directories, potentially further corrupting the directory. Try to start fresh and see whether you can still reproduce.

@AleSR13
Copy link

AleSR13 commented Apr 10, 2021

Hello, I think I might have the same issue. I have a rule to demultiplex Illumina data and I don't know which otuput I expect (or the number of files at least). I use checkpoints to continue the analysis (fastqc, multiqc, etc) after the checkpoint. The problem is that it runs fine until the checkpoint,. Then it properly (as far as I can tell) recalculates the DAG and then it stops (3% complete but exit 0) so it didn't really run the downstream rules. If I restart the pipeline, it starts where it left and properly finishes the jobs. I guess I can just put it in production like this, but it is not ideal. I am using snakemake v 6.1.1 (in conda environment).

@Maarten-vd-Sande
Copy link
Contributor

This bug still exists. I don't have it with 5.18, but I do have it with higher version (I didn't pin-point the specific release). 6.8.0 still has it. 😭

@siebrenf
Copy link

Here's a minimal reproducible example.

Tested in a clean conda environment with snakemake 6.8.0
Runs with rm -f a1.txt; rm -rf .snakemake; snakemake --cores 1

rule all:
    input:
        "b.txt"

checkpoint a:
    output:
        # to reproduce the bug:
        # >1 output required
        # one or more temp() outputs required
        # exception: will work if only the first output is temp()
        temp('a1.txt'),
        temp('a2.txt'),
    shell:
        "touch {output}"

def _checkpoint_output(wildcards):
    out = checkpoints.a.get(**wildcards).output
    return out

rule b:
    input:
        _checkpoint_output,
    output:
        temp('b.txt')
    shell:
        "touch {output}"

about the command

The command used to run this file removes the .snakemake folder. This is optional, but ensures it's not the same issue as described above. It also removes file a1.txt as this remains after the incomplete run. Running the code a second time without removing a1.txt completes the run.

about the code

Most temp()s are added for fast rerunning only. The only one that matters is the second temp() in the checkpoint (as explained in the comments).
Using the input function _checkpoint_output() in rule b is required to reproduce the bug.

my 2 cents

This example somewhat compares to our (@Maarten-vd-Sande and I) use-case, in that it uses a similar input function downstream of the checkpoint (and only the rules using it are affected). However, our workflow's checkpoint has only one output. The issue is therefore probably wider, and my guess would be it is related to input functions using checkpoints.

@siebrenf
Copy link

siebrenf commented Sep 21, 2021

And here's a reproducible example matching our use-case!

Runs with rm -r a.txt; rm -f b.txt; rm -f c.txt; rm .snakemake; snakemake --cores 2

rule all:
    input:
        "c.txt",

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

rule b1:
    output:
        pipe("b.pipe")
    shell:
        ""

rule b2:
    input:
        pipe("b.pipe")
    output:
        "b.txt"
    shell:
        "touch {output}"

rule c:
    input:
        "a.txt",
        "b.txt",
    output:
        "c.txt"
    shell:
        "touch {output}"

The input function from the previous example did not matter for this one. Instead it's piping (similar to #132)!

EDIT: extended the example with rule c to clearly distinguish this bug with the bug in the comment below.

@siebrenf
Copy link

siebrenf commented Sep 21, 2021

I found one more edge case while making the above example (sorry!)

the following code works fine:

rule all:
    input:
        "b.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

rule b:
    input:
        "a.txt"
    output:
        "b.txt"
    shell:
        "touch {output}"

but this stops early:

rule all:
    input:
        "a.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

in this case, it seems the target rule cannot contain files coming directly from a checkpoint.

EDIT: simplified.

@epruesse
Copy link
Contributor

epruesse commented Oct 1, 2021

So this one:

rule all:
    input:
        "a.txt"

checkpoint a:
    output:
        "a.txt"
    shell:
        "touch {output}"

That sounds like it's just some overly aggressive avoiding of running rules. If you add a shell: statement to the all rule, it works. If you add an output:, it also works. The only "bug" is that the logging shows only 1/2 jobs completed because the second job gets deleted from the dag and then there is nothing left to do - exit. The only real issue would be if you wanted a message: to be printed, which won't happen.

@epruesse
Copy link
Contributor

epruesse commented Oct 1, 2021

Ok, so the problem with the first MRE is as follows: The job for checkpoint a is executed and a1.txt and a2.txt are created. The job finishes, and the temp files that aren't needed are deleted. This is verified by checking the depends of all jobs for each file. Jobs for rules depending on checkpoints do not have output yet, so they aren't in the list, so files needed only by those rules get deleted. The reason why the first temp() works is that this file is actually added to the required files list as a stub.

MRE 1 works fine if you pass --no-temp and temp deletion is disabled.

epruesse added a commit to epruesse/snakemake that referenced this issue Oct 2, 2021
The output of checkpoint rules may be needed by other rules, which, at this point have not had a chance to add their file requirements to their depending list.
epruesse added a commit to epruesse/snakemake that referenced this issue Oct 2, 2021
- Can't have the b.txt in first text be temp and expect it present, it
will correctly be deleted.
- Add third MRE which is targeting more of a log issue
epruesse added a commit to epruesse/snakemake that referenced this issue Oct 2, 2021
The main scheduler loop terminates if there are no more runnable
jobs. This tests if there still are jobs that should be run, but that
we cannot seem to get to. If so, print an error message, print the
files affected, and exit 1. The latter is important so Snakemake can,
as hotfix, be rerun in a loop until it's actually completed all
tasks.
@aryarm
Copy link
Contributor

aryarm commented Oct 11, 2021

@epruesse and @siebrenf, could the second MRE in that list have anything to do with #1173, which was merged into v6.8.1?

I ran the second test with both Snakemake v6.8.0 and v6.8.1. It failed on v6.8.0 but had the expected output on v6.8.1.

@DonFreed
Copy link
Contributor

DonFreed commented Oct 11, 2021

I was able to reproduce the second MRE with v6.9.1. The failure is not deterministic works about 33% of the time.

It could just be that you got (un)lucky in your test with v6.8.1?

@aryarm
Copy link
Contributor

aryarm commented Oct 11, 2021

It could just be that you got (un)lucky in your test with v6.8.1?

Yup, that was it! I ran it 15 more times. It failed on 7 of those.

DonFreed added a commit to DonFreed/snakemake that referenced this issue Oct 12, 2021
- Incorporates @epruesse's fix for MRE snakemake#1
- Adds a fix for MRE snakemake#2 - properly marks group jobs as finished
- Some minor updates to tests
johanneskoester added a commit that referenced this issue Oct 25, 2021
* add failing tests 823

* fix mistakes

* black

* Fix the first two MREs from #823.

- Incorporates @epruesse's fix for MRE #1
- Adds a fix for MRE #2 - properly marks group jobs as finished
- Some minor updates to tests

* Fix tests on Windows

* Skip MRE 2 from 823 on Windows due to `pipe()` output

Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
pvandyken pushed a commit to pvandyken/snakemake that referenced this issue Nov 15, 2021
* add failing tests 823

* fix mistakes

* black

* Fix the first two MREs from snakemake#823.

- Incorporates @epruesse's fix for MRE #1
- Adds a fix for MRE #2 - properly marks group jobs as finished
- Some minor updates to tests

* Fix tests on Windows

* Skip MRE 2 from 823 on Windows due to `pipe()` output

Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
johanneskoester added a commit that referenced this issue Feb 21, 2022
* add failing tests 823

* fix mistakes

* black

* Update tests from #823 MREs

- Can't have the b.txt in first text be temp and expect it present, it
will correctly be deleted.
- Add third MRE which is targeting more of a log issue

* Add error message printed if #823 is encountered, also exit 1

The main scheduler loop terminates if there are no more runnable
jobs. This tests if there still are jobs that should be run, but that
we cannot seem to get to. If so, print an error message, print the
files affected, and exit 1. The latter is important so Snakemake can,
as hotfix, be rerun in a loop until it's actually completed all
tasks.

* Fix checkpoint+temp leads to incomplete run issue

The aggressive early deletion of temp files must ignore
checkpoints. Other rules depend on a checkpoint only via the first
output file of the checkpoint (flagged string) as a stub. All other
outputs marked as temp would be deleted (and have been before this
patch). Since those files are missing, the dependent rules will never
be runnable and the workflow ends before all targets have been built.

* Don't break untiljobs

* fix typo

Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
@johanneskoester
Copy link
Contributor

Should be fixed by merging PR #1203. Thanks to all of you for the detective work, the fixes and the examples!

@ASLeonard
Copy link
Contributor

After upgrading to v7.0.0, I've come across this error when using snakemake on a cluster. I'm still testing it some more, but it appears that when a job failed (because of a known/valid reason), when all jobs are done on the cluster, snakemake then exits with this error specifying that the job that failed on the cluster is a "remaining job".

BUG: Out of jobs ready to be started, but not all files built yet. Please check https://github.com/snakemake/snakemake/issues/823 for more information.
Remaining jobs:
 - <rule_name>: <file> 

@corneliusroemer
Copy link
Contributor

Came across this bug (?) in v7.3.0. Not sure why
image

Is this really a bug? The message is confusing. @johanneskoester

@nh13
Copy link
Contributor

nh13 commented Apr 15, 2022

I also hit this, but in 7.3.8. There were upstream jobs that failed, and I had --keep-going set. I think it was a red herring.

@corneliusroemer
Copy link
Contributor

Ah yes, I think I hut it in the same situation and it sounds right, red herring.

Probably new issue should be opened to deal with it.

@FuduanPeng
Copy link

it seems that this problem happens randomly, most of my jobs were successfully finished, but only 4 of them went unfinished.

@chrarnold
Copy link
Contributor

I also had this in v7.8 last week, can someone comment on how to solve it? Is --keep-going responsible, and if so, how? I need --keep-going, cant remove it usually.

@epruesse
Copy link
Contributor

epruesse commented Jul 5, 2022

@chrarnold You can use --keep-going safely. It's the same as with make - it just is fast vs slow fail. Probably best to have on most of the time, so that you come back Monday morning to everything done except for that one sample with 15 reads after qc (instead of to almost nothing done because it started no further jobs after the first one failed).

@johanneskoester
Copy link
Contributor

johanneskoester commented Nov 29, 2023

This is indeed a bug. A PR with a testcase that reproduces this would be very welcome. We think that at least one way to trigger this is by combining checkpoints with group jobs.

@johanneskoester
Copy link
Contributor

cc @christopher-schroeder

@Erythroxylum
Copy link

Same bug occurred with ecoli test run:
snakemake v7.32.4
Python v3.11.4

All jobs finished after restarting with --keep-going
Not sure if it would have finished without that flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests