Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression after 7.7.0 until 7.15.X: rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed #1818

Closed
hermidalc opened this issue Aug 18, 2022 · 32 comments
Labels
bug Something isn't working

Comments

@hermidalc
Copy link

hermidalc commented Aug 18, 2022

Snakemake version

7.15.1

Describe the bug

Rules downstream of a checkpoint aggregation rule are always re-executed when both aggregation and downstream rules have already been run and output files exist.

Minimal example

Using the docs example and adding two rules downstream of aggregate. It keeps re-running the process and process2 rules even though output files exists and nothing upstream had changed. It doesn't matter if you have one rule after aggregate or multiple, it keeps re-running all of them even when the outputs exist and nothing upstream has changed. If you remove the downstream rules and go back to the exact docs code, things work as expected.

# a target rule to define the desired final output
rule all:
    input:
        "processed2.txt",


# the checkpoint that shall trigger re-evaluation of the DAG
# an number of file is created in a defined directory
checkpoint somestep:
    output:
        directory("my_directory/"),
    shell:
        """
        mkdir my_directory/
        cd my_directory
        for i in 1 2 3; do touch $i.txt; done
        """


# input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
def aggregate_input(wildcards):
    checkpoint_output = checkpoints.somestep.get(**wildcards).output[0]
    return expand(
        "my_directory/{i}.txt",
        i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i,
    )


rule aggregate:
    input:
        aggregate_input,
    output:
        "aggregated.txt",
    shell:
        "echo AGGREGATED > {output}"


rule process:
    input:
        "aggregated.txt",
    output:
        "processed.txt",
    shell:
        "echo PROCESSED > {output}"


rule process2:
    input:
        "processed.txt",
    output:
        "processed2.txt",
    shell:
        "echo PROCESSED2 > {output}"
@hermidalc hermidalc added the bug Something isn't working label Aug 18, 2022
@hermidalc hermidalc changed the title Rules downstream of a checkpoint aggregation rule keep being re-run even when all output files exist and nothing upstream has changed Rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed Aug 18, 2022
@hermidalc
Copy link
Author

Hi @johanneskoester and Snakemake team - sorry to bug, do you have any word on this? It's a bug that is pretty impacting, that snakemake is always re-executing (sometimes time-consuming) rules downstream of a checkpoint even though all files exist and mtimes are good, and my workaround now is to comment out rules to avoid this. I have above a clear and reproducible example.

@mkiyer
Copy link

mkiyer commented Aug 24, 2022

#1694 I believe this is connected to the comment I made just now.

I suspect the new re-run behavior is being inappropriately triggered, perhaps by rules with input/output/params that are functions. I am not sure how the behavior is implemented, but I wonder -- if one wrote a rule that used the current time of day function (or a random number, etc) as an input, would reruns always occur?

@mkiyer
Copy link

mkiyer commented Aug 24, 2022

I was able to dig into this a bit more:

When I run "--list-params-changes" I always get the same result:
--list-params-changes
Building DAG of jobs...
runs/test_nras_q61/gatk/bqsr_recal.table
runs/test_nras_q61/gatk/aln.markdup.splitncigar.bam
runs/test_nras_q61/gatk/aln.markdup.bam
runs/test_nras_q61/gatk/aln.markdup.bam.bai
runs/test_nras_q61/gatk/markdup_metrics.txt
runs/test_nras_q61/gatk/aln.bqsr.bam
runs/test_nras_q61/gatk/variants.vcf.gz
runs/test_nras_q61/gatk/variants.filtered.vcf.gz

Snakemake always detects params changes in All of these rules all the time.

I think it has to do with setting the java options, which I am doing dynamically.

def get_java_resource_options(wildcards, resources): return f"-Djava.io.tmpdir={resources.tmpdir} -Xms{resources.mem_mb}m -Xmx{resources.mem_mb}m"

And then rules that call this function are always "changing"

rule gatk_split_n_cigar_reads:
conda:
"envs/gatk4.yaml"
input:
bam = rules.gatk_mark_duplicates_spark.output.bam,
genome_fasta = get_ref_path('genome_fasta')
output:
bam = "runs/{run_id}/gatk/aln.markdup.splitncigar.bam"
params:
java_options = get_java_resource_options
threads: 3
resources:
mem_mb = 20000
shell:
"gatk SplitNCigarReads "
'--java-options "{params.java_options}" '
"-R {input.genome_fasta} "
"-I {input.bam} "
"-O {output.bam}"

@hermidalc
Copy link
Author

hermidalc commented Aug 24, 2022

Looking at the simpler complete example in the OP might help better troubleshoot, not sure why it keeps saying Input files updated by another job: <file> for all <file> after the aggregate rule when it knows that the checkpoint somestep and rule aggregate jobs didn't need re-execution because all output files are there and unchanged:

$ snakemake --cores 1
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
all             1              1              1
process         1              1              1
process2        1              1              1
total           3              1              1

Select jobs to execute...

[Tue Aug 23 21:05:11 2022]
rule process:
    input: aggregated.txt
    output: processed.txt
    jobid: 2
    reason: Input files updated by another job: aggregated.txt
    resources: tmpdir=/tmp

[Tue Aug 23 21:05:11 2022]
Finished job 2.
1 of 3 steps (33%) done
Select jobs to execute...

[Tue Aug 23 21:05:11 2022]
rule process2:
    input: processed.txt
    output: processed2.txt
    jobid: 1
    reason: Input files updated by another job: processed.txt
    resources: tmpdir=/tmp

[Tue Aug 23 21:05:11 2022]
Finished job 1.
2 of 3 steps (67%) done
Select jobs to execute...

[Tue Aug 23 21:05:11 2022]
localrule all:
    input: processed2.txt
    jobid: 0
    reason: Input files updated by another job: processed2.txt
    resources: tmpdir=/tmp

[Tue Aug 23 21:05:11 2022]
Finished job 0.
3 of 3 steps (100%) done
Complete log: .snakemake/log/2022-08-23T210511.230811.snakemake.log

@jdblischak
Copy link
Contributor

jdblischak commented Aug 24, 2022

Using the provided reprex, I confirmed that this was introduced in 7.8.0:

  • the bug exists for Snakemake 7.12.1, 7.12.0, 7.8.2, 7.8.1, 7.8.0
  • the bug does not exist for Snakemake 7.7.0

I purposefully tested the boundary of 7.8.1 and 7.8.2 because according to the changelog, the fix c634b78 from PR #1704 was intended to address this issue of re-running rules downstream of a checkpoint.

@cpauvert
Copy link
Contributor

cpauvert commented Sep 12, 2022

Hi,
thanks for the reprex, I was running in the same problem in production. I confirm this bug with 7.8.2 as well as the latest 7.14.0.

On top of that, I found that Snakemake does not correctly list the files that should be updated, despite indicating the reason for rerun is a change in input files.

snakemake -c 1 -n --list-input-changes
snakemake -c 1 -n --list-code-changes
snakemake -c 1 -n --list-params-changes
# All yields only the following preparation message
# Building DAG of jobs...

EDIT: Using --rerun-triggers input is enough to trigger the re-execution behavior while --rerun-triggers mtime and others do not

$ snakemake -c 1 -n --rerun-triggers input
Building DAG of jobs...
Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
all             1              1              1
process         1              1              1
process2        1              1              1
total           3              1              1


[Wed Sep 14 09:47:12 2022]
rule process:
    input: aggregated.txt
    output: processed.txt
    jobid: 2
    reason: Input files updated by another job: aggregated.txt
    resources: tmpdir=/tmp


[Wed Sep 14 09:47:12 2022]
rule process2:
    input: processed.txt
    output: processed2.txt
    jobid: 1
    reason: Input files updated by another job: processed.txt
    resources: tmpdir=/tmp


[Wed Sep 14 09:47:12 2022]
localrule all:
    input: processed2.txt
    jobid: 0
    reason: Input files updated by another job: processed2.txt
    resources: tmpdir=/tmp

Job stats:
job         count    min threads    max threads
--------  -------  -------------  -------------
all             1              1              1
process         1              1              1
process2        1              1              1
total           3              1              1


This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

@hermidalc
Copy link
Author

hermidalc commented Sep 21, 2022

Sorry to complain... hope this will get fixed at some point soon because it's a pretty major regression. Causing me issues currently because I have a very large snakemake workflow where rules downstream of a checkpoint aggregation are computationally expensive to run and they keep getting re-executed even though they shouldn't.

@hermidalc
Copy link
Author

I list snakemake --list-input-changes --list-code-changes --list-params-changes --list-untracked --list-version-changes and there's nothing, yet snakemake wants to re-run so many rules, really this is a serious bug to a fundamental part of how snakemake works

@twillis209
Copy link

twillis209 commented Sep 25, 2022

I've had a nasty surprise from this bug, too, unfortunately. I have a very large workflow which I've been running piecemeal with --batch to get around the long DAG generation times. Any ideas for a workaround? At the moment I'm just using os.path.exists to establish which files don't exist.

@hermidalc
Copy link
Author

I don't have a workaround to this yet other than commenting things out which of course doesn't work in every case when you are developing a workflow, and I have a large workflow with three checkpoints! So this bug is really causing me trouble.

@ning-y
Copy link
Contributor

ning-y commented Sep 27, 2022

@hermidalc The workaround is to avoid using checkpoints for now. For example, checkpoints can be converted into rules which have an additional output containing a list of files they had produced, and downstream rules can use that list of files as input. It is hacky, and not the best solution, but the alternative is to fix the bug ourselves (I've tried and failed, the source code is a little too complex for the amount of free time I have).

@jdblischak
Copy link
Contributor

@hermidalc What about pinning Snakemake to version 7.7.0? Is there some important feature in a more recent release that you need/want to use? If you've got 3 checkpoint rules in a pipeline, updating doesn't seem like it would be worth it to me (I've only got 1 checkpoint rule, but I'm not planning on updating any time soon).

@hermidalc
Copy link
Author

hermidalc commented Sep 27, 2022

@hermidalc What about pinning Snakemake to version 7.7.0? Is there some important feature in a more recent release that you need/want to use? If you've got 3 checkpoint rules in a pipeline, updating doesn't seem like it would be worth it to me (I've only got 1 checkpoint rule, but I'm not planning on updating any time soon).

Thanks @jdblischak for providing the last working version, I was wondering which one it was and hadn't browsed the CHANGELOG yet. Yeah I will downgrade instead of doing @ning-y recommendation since requires not code changes, though @ning-y it was a good one as a workaround.

@jdblischak
Copy link
Contributor

Actually I realized this isn't such an easy decision. Version 7.8.0 also introduced another feature that I love: Freezing environments to exactly pinned packages. So far I've been lucky in that the pipelines in which I've pinned the conda envs haven't relied on checkpoint rules. It's unfortunate that this new feature was introduced in the same release as this bug. I don't want to have to choose!

@hermidalc
Copy link
Author

Actually I realized this isn't such an easy decision. Version 7.8.0 also introduced another feature that I love: Freezing environments to exactly pinned packages. So far I've been lucky in that the pipelines in which I've pinned the conda envs haven't relied on checkpoint rules. It's unfortunate that this new feature was introduced in the same release as this bug. I don't want to have to choose!

I just read through this didn't know of the feature... though can't initially understand, is this feature providing architecture specific pinning? Because can't one already specify package version/build requirements and pinning in the regular conda env yaml files that rules use? I already am doing that.

I can understand the feature add to give a way to provide arch specific envs that are seamless to the user at deployment. For me this isn't an immediate problem as I'm not generally writing workflows for automatic deployment yet, only for publishing easy to install software. So every project I've done with conda I test it on both Linux and Mac to make sure all ok with a single env yaml spec, or I provide two yamls one for Linux and one for Mac if this is not possible and in the README tell people to install the right one for their arch.

@jdblischak
Copy link
Contributor

It saves the environment using conda's explicit specification file format. This has 2 big implications:

  1. The solver doesn't need to be run. Thus installation is much faster. It just downloads and installs the exact URLs listed in that file
  2. It pins the exact versions of everything installed. Not just your top-level dependencies, but also all the recursive dependencies. This avoids random failures due to a dependency you didn't even know about breaking your code. As a concrete example that recently happened to me, I had pinned numpy to 1.20, but one of its dependencies, xarray, broke its interface with older numpy versions and totally borked by code. By using conda list --explicit, dependencies like xarray are also pinned, so they can't surprise you months later

The explicit specification is architecture specific. Snakemake only uses these pinned envs if they match the current architecture, otherwise it falls back to the standard conda environment files. In other words, if you pin the conda envs for linux-64, and someone executes the pipeline on macOS, there won't be a problem.

Here's a minimal example I created that you can explore:

https://github.com/jdblischak/smk-simple-slurm/tree/main/examples/pinned-conda-envs

@hermidalc
Copy link
Author

hermidalc commented Oct 8, 2022

@jdblischak @ning-y I've been using 7.7.0 and thank you for the suggestion.

But it still has issues with re-running rules downstream of a checkpoint aggregation rule that just ran successfully and it has the files for. But with 7.7.0 it does the OP example correctly and doesn't re-run things. I've checked my checkpoint and downstream rule code and to the best of my knowledge it's designed correctly, so I'm at a loss. The DAG it produces and files it generates is correct so can't figure out why it's trying to re-run these rules when the files are there and were just created.

In the large workflow I'm currently working on it's still wanting to re-run rules downstream of a checkpoint. It also gives me zero explanation as to why, it doesn't even show a message that there are any input/params/code changes and again the --list-* options nothing comes up. So I'm back to having to comment things out to non re-run expensive downstream rules.

I'm really against getting rid of my checkpoints and doing the output file list rule approach. I know this sounds harsh, but Snakemake is at major version 7, so it really shouldn't have such a important and fundamental issue with how it works, it's very frustrating. I am a python developer, but honestly right now would take me a long time to get familiar with the codebase and fix it myself. Something so important should be fixed by the core devs and it should be made a priority.

@hermidalc
Copy link
Author

hermidalc commented Oct 8, 2022

@cclienti and @timtroendle since you developed the post-7.7.0 functionality, which I apologize to say there is a major bug it doesn't seem to work properly with checkpoints, or maybe a commit after broke this, but do you have any idea what's broken?

@hermidalc
Copy link
Author

@jdblischak @ning-y I've been using 7.7.0 and thank you for the suggestion.

But it still has issues with re-running rules downstream of a checkpoint aggregation rule that just ran successfully and it has the files for. But with 7.7.0 it does the OP example correctly and doesn't re-run things. I've checked my checkpoint and downstream rule code and to the best of my knowledge it's designed correctly, so I'm at a loss. The DAG it produces and files it generates is correct so can't figure out why it's trying to re-run these rules when the files are there and were just created.

In the large workflow I'm currently working on it's still wanting to re-run rules downstream of a checkpoint. It also gives me zero explanation as to why, it doesn't even show a message that there are any input/params/code changes and again the --list-* options nothing comes up. So I'm back to having to comment things out to non re-run expensive downstream rules.

I'm really against getting rid of my checkpoints and doing the output file list rule approach. I know this sounds harsh, but Snakemake is at major version 7, so it really shouldn't have such a important and fundamental issue with how it works, it's very frustrating. I am a python developer, but honestly right now would take me a long time to get familiar with the codebase and fix it myself. Something so important should be fixed by the core devs and it should be made a priority.

Sorry my bad @jdblischak @ning-y, with 7.7.0 I figured what was going on, timestamps... timestamps... timestamps! Since I've been developing this large workflow and during the process only re-running parts of it, then timestamps for steps in the DAG weren't in the order they should be as if I ran the entire workflow from scratch.

But still with post-7.7.0 the OP issue still exists, regardless of timestamps or anything else, so it's a major issue. But at least when I publish my software and codebase for others to use with 7.7.0 things will be ok.

@hermidalc
Copy link
Author

I just see now the very useful --touch option. Sorry snakemake is very feature rich and has so many options my eyes do miss things when running --help

@ning-y
Copy link
Contributor

ning-y commented Oct 8, 2022

@hermidalc I'm glad you were able to find at least a partial solution.

While I agree that this bug is particularly frustrating, it may not be reasonable to expect a fix so soon. As far as I know, snakemake is an open-source project maintained by other scientists, researchers, and teachers; they have many formal responsibilities to be busy with, very few of which (if any) depend on maintaining this software. As much as you or I lack time to learn the codebase, I suspect they too lack the time to tackle this bug. So it goes, I guess.

@hermidalc
Copy link
Author

hermidalc commented Oct 8, 2022

@hermidalc I'm glad you were able to find at least a partial solution.

While I agree that this bug is particularly frustrating, it may not be reasonable to expect a fix so soon. As far as I know, snakemake is an open-source project maintained by other scientists, researchers, and teachers; they have many formal responsibilities to be busy with, very few of which (if any) depend on maintaining this software. As much as you or I lack time to learn the codebase, I suspect they too lack the time to tackle this bug. So it goes, I guess.

All open-source projects have a group of core devs that are maintaining the core, controlling what features get developed and included and all pull requests... i.e. basically driving things, it doesn't truly work the way you are stating where the entire project is completely maintained and developed only by the wider community, otherwise design, vision, features would become a hot mess. And this is an issue with core snakemake, not a specific additional more or less self-contained feature.

@hermidalc
Copy link
Author

hermidalc commented Oct 8, 2022

And @ning-y sorry for you to be in the firing line, but look at the https://snakemake.readthedocs.io/en/stable/project_info/history.html, there are version updates quite often and I would argue the core devs (not random people in the wider community with pull requests) are spending cycles and making commits for some things that aren't regressions as fundamentally broken as this. Regressions to core functionality are big deals in soft dev. The need to use checkpoints are actually quite common in our work (I'm an computational biologist/bioinformatician too) and I would argue are as much part of core snakemake as a rule is.

@ning-y
Copy link
Contributor

ning-y commented Oct 8, 2022

@hermidalc My view is that most scientific software operates in a system which does not reward maintenance, especially for purely academic projects like Snakemake (c.f. Rstudio). Furthermore, developers of these software are often swamped with larger responsibilities such as research or teaching which manifest formal and harsher penalties if ignored. Faced with these pressures, I think it is reasonable that many bugs are left unfixed; or that easier to tackle issues are repeatedly addressed over complex ones.

Anyways, I think we are getting off-topic. We can agree to disagree, but again I feel your frustration. All my projects depend on Snakemake.

To steer us back, have you tried using 7.7.0 but specifying all the available rerun triggers? I think you wouldn't have to give up checkpoints then, and you would still get the benefit of updated code rerun triggers, etc.

@hermidalc
Copy link
Author

hermidalc commented Oct 8, 2022

@hermidalc My view is that most scientific software operates in a system which does not reward maintenance, especially for purely academic projects like Snakemake (c.f. Rstudio). Furthermore, developers of these software are often swamped with larger responsibilities such as research or teaching which manifest formal and harsher penalties if ignored. Faced with these pressures, I think it is reasonable that many bugs are left unfixed; or that easier to tackle issues are repeatedly addressed over complex ones.

Anyways, I think we are getting off-topic. We can agree to disagree, but again I feel your frustration. All my projects depend on Snakemake.

To steer us back, have you tried using 7.7.0 but specifying all the available rerun triggers? I think you wouldn't have to give up checkpoints then, and you would still get the benefit of updated code rerun triggers, etc.

Yes thank you, 7.7.0 does work properly with checkpoints it was the touch/timestamp issue and my earlier comment on it can be disregarded. It was my bad I've been developing a larger workflow so when working on parts and testing them timestamps easily get messed up.

@hermidalc hermidalc changed the title Rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed Regression after 7.7.0 until 7.15.x: rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed Oct 11, 2022
@hermidalc hermidalc changed the title Regression after 7.7.0 until 7.15.x: rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed Regression after 7.7.0 until 7.15.X: rules downstream of a checkpoint aggregation rule keep being re-executed even when all output files exist and nothing upstream has changed Oct 11, 2022
@johanneskoester
Copy link
Contributor

Hey folks. Sorry for not seeing this issue before. Also thanks a lot for your understanding. I get so many github notifications that I cannot possibly see through all of them. Instead, what you can always do is to ping me via discord if an issue is particularly pressing. I'll have a look at the minimal example ASAP. In general, if you want to speed up or get more attention, one option is always to create a PR with the minimal example. This way I already have a skeleton for debugging.

@johanneskoester
Copy link
Contributor

But in this case, this is not needed as I am looking into this already.

@johanneskoester
Copy link
Contributor

Ok, problem identified. If have to catch a train now, but I'll be able to fix it likely tomorrow.

@hermidalc
Copy link
Author

hermidalc commented Oct 13, 2022

Thank you @johanneskoester! I wasn't in a big rush because could downgrade to 7.7.0. With the latest snakemake I noticed checkpoints and DAG resolution is much faster compared to 7.7.0, likely due to some performance improvements commits, and in my current project I'm processing almost 100k files from a checkpoint, so will be happy to upgrade once the new version is out on bioconda

@johanneskoester
Copy link
Contributor

Thanks for the minimal example and the exploration work all of you already did. This is now fixed in PR #1907!

johanneskoester added a commit that referenced this issue Oct 14, 2022
fixes issue #1818

### Description

<!--Add a description of your PR here-->

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).
@Aiswarya-prasad
Copy link

This seems to be an issue also on 7.8.5 and it remained when I tried to use 6* (which I do not prefer due to other options that 7* offers)
What is the current status of this bug / suggested work-around?
In summary, the output of my aggregate rule (called ) triggers the re-run of other downstream rules even though the outputs of those rules are newer than it which is what as I understand this issue describes 7.70 and earlier versions are not supposed to be suffering from this bug?

@jdblischak
Copy link
Contributor

@Aiswarya-prasad This Issue describes a bug that was introduced in Snakemake 7.8.0 and was fixed in 7.16.0. Thus it is expected that the bug exists in 7.8.5. Your options are to either upgrade to 7.16.0+ (recommended) or downgrade to 7.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants