Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to create files a second time in a single snakemake run #884

Closed
a-ludi opened this issue Feb 19, 2021 · 1 comment
Closed

Attempt to create files a second time in a single snakemake run #884

a-ludi opened this issue Feb 19, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@a-ludi
Copy link

a-ludi commented Feb 19, 2021

Snakemake version
Snakemake 5.32.1 powered by Python 3.9.1 on two different machines:

  1. Laptop: Arch-based Linux
  2. Cluster: CentOS Linux release 7.4.1708 (Core)

Describe the bug
The bug appears only on my latop and not on our cluster even though the environment are pretty similar.

The problem is that after running local rule make_merge_config snakemake suddenly wants to re-run the already finished checkpoint rule collect. As expected, Snakemake then crashes with an ProtectedOutputException because the output of collect is protected. But it should not run collect at that point because the output workdir/pile-ups.db is already up-to-date.

After lifting the protection from all outputs (chmod -R +w workdir) and restarting the workflow, it takes up operation at collect and finishes as expected.

Logs

[Fri Feb 19 17:00:17 2021]
checkpoint collect:
    input: workdir/.assembly-test.dentist-self-H.anno, workdir/.assembly-test.dentist-self-H.data, workdir/.assembly-test.tan-H.anno, workdir/.assembly-test.tan-H.data, workdir/.assembly-test.dentist-reads-H.anno, workdir/.assembly-test.dentist-reads-H.data, dentist.json, workdir/assembly-test.dam, workdir/.assembly-test.bps, workdir/.assembly-test.hdr, workdir/.assembly-test.idx, workdir/reads.db, workdir/.reads.bps, workdir/.reads.idx, workdir/assembly-test.reads.las
    output: workdir/pile-ups.db
    log: logs/collect.log
    jobid: 5
    reason: Missing output files: workdir/pile-ups.db; Input files updated by another job: workdir/.assembly-test.dentist-self-H.data, workdir/.assembly-test.dentist-reads-H.data, workdir/.assembly-test.tan-H.anno, workdir/.assembly-test.dentist-self-H.anno, workdir/assembly-test.reads.las, workdir/.assembly-test.dentist-reads-H.anno, workdir/.assembly-test.tan-H.data
    threads: 4
Downstream jobs will be updated after completion.

Activating singularity image /home/alu/projects/pb-gaps/src/dentist/dentist_dirty.sif
Write-protecting output file workdir/pile-ups.db.
Updating job make_merge_config.
Updating job merge_insertions.
Updating job process.
[Fri Feb 19 17:00:43 2021]
Finished job 5.
53 of 106 steps (50%) done

[Fri Feb 19 17:00:43 2021]
rule process:
    input: workdir/.assembly-test.dentist-self-H.anno, workdir/.assembly-test.dentist-self-H.data, workdir/.assembly-test.tan-H.anno, workdir/.assembly-test.tan-H.data, workdir/.assembly-test.dentist-reads-H.anno, workdir/.assembly-test.dentist-reads-H.data, dentist.json, workdir/assembly-test.dam, workdir/.assembly-test.bps, workdir/.assembly-test.hdr, workdir/.assembly-test.idx, workdir/reads.db, workdir/.reads.bps, workdir/.reads.idx, workdir/pile-ups.db
    output: workdir/insertions/batch.1.db
    log: logs/process.1.log
    jobid: 151
    reason: Missing output files: workdir/insertions/batch.1.db
    wildcards: batch_id=1
    threads: 4

Activating singularity image /home/alu/projects/pb-gaps/src/dentist/dentist_dirty.sif
[Fri Feb 19 17:01:01 2021]
Finished job 151.
54 of 106 steps (51%) done

[Fri Feb 19 17:01:01 2021]
rule process:
    input: workdir/.assembly-test.dentist-self-H.anno, workdir/.assembly-test.dentist-self-H.data, workdir/.assembly-test.tan-H.anno, workdir/.assembly-test.tan-H.data, workdir/.assembly-test.dentist-reads-H.anno, workdir/.assembly-test.dentist-reads-H.data, dentist.json, workdir/assembly-test.dam, workdir/.assembly-test.bps, workdir/.assembly-test.hdr, workdir/.assembly-test.idx, workdir/reads.db, workdir/.reads.bps, workdir/.reads.idx, workdir/pile-ups.db
    output: workdir/insertions/batch.2.db
    log: logs/process.2.log
    jobid: 152
    reason: Missing output files: workdir/insertions/batch.2.db
    wildcards: batch_id=2
    threads: 4

Activating singularity image /home/alu/projects/pb-gaps/src/dentist/dentist_dirty.sif
[Fri Feb 19 17:01:21 2021]
Finished job 152.
55 of 106 steps (52%) done

[Fri Feb 19 17:01:21 2021]
rule process:
    input: workdir/.assembly-test.dentist-self-H.anno, workdir/.assembly-test.dentist-self-H.data, workdir/.assembly-test.tan-H.anno, workdir/.assembly-test.tan-H.data, workdir/.assembly-test.dentist-reads-H.anno, workdir/.assembly-test.dentist-reads-H.data, dentist.json, workdir/assembly-test.dam, workdir/.assembly-test.bps, workdir/.assembly-test.hdr, workdir/.assembly-test.idx, workdir/reads.db, workdir/.reads.bps, workdir/.reads.idx, workdir/pile-ups.db
    output: workdir/insertions/batch.0.db
    log: logs/process.0.log
    jobid: 30
    reason: Missing output files: workdir/insertions/batch.0.db
    wildcards: batch_id=0
    threads: 4

Activating singularity image /home/alu/projects/pb-gaps/src/dentist/dentist_dirty.sif
[Fri Feb 19 17:01:43 2021]
Finished job 30.
56 of 106 steps (53%) done

[Fri Feb 19 17:01:43 2021]
localrule make_merge_config:
    input: workdir/pile-ups.db, workdir/insertions/batch.0.db, workdir/insertions/batch.1.db, workdir/insertions/batch.2.db
    output: workdir/dentist.merge.json
    jobid: 4
    reason: Missing output files: workdir/dentist.merge.json; Input files updated by another job: workdir/insertions/batch.1.db, workdir/insertions/batch.2.db, workdir/insertions/batch.0.db

Job counts:
    count   jobs
    1   collect
    1   make_merge_config
    3   process
    2   propagate_mask_back_to_reference
    56  propagate_mask_back_to_reference_block
    4   propagate_mask_batch
    54  propagate_mask_to_reads_block
    121
ProtectedOutputException in line 1240 of /home/alu/projects/pb-gaps/src/dentist-example/Snakefile:
Write-protected output files for rule collect:
workdir/pile-ups.db
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 136, in run_jobs
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 441, in run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 230, in _run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in _run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/alu/projects/pb-gaps/src/dentist-example/.snakemake/log/2021-02-19T165642.255438.snakemake.log

Minimal example
It is very hard to reduce the 1500 lines Snakefile to a minimal example but I can share a small example of my compelete workflow (431M). Please follow the instructions in the included README.md. It takes 5 minutes until the error on my Laptop with an Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz. It will require a few Gb of memory. Sorry.

Additional context
This might be related to #823.

@a-ludi a-ludi added the bug Something isn't working label Feb 19, 2021
@a-ludi
Copy link
Author

a-ludi commented Jun 18, 2021

I finally tracked down the bug! In short, I forgot to correctly specify input files from a checkpoint which caused the DAG to be in an undefined state. Here is an minimal example to illustrate the issue:

# Snakefile
outdir = "out"
count_file = f"{outdir}/count"
part_file = f"{outdir}/part-{{n}}.txt"
merged_file = f"{outdir}/all.txt"


def get_count():
    try:
        with open(count_file) as cf:
            return int(cf.read())
    except FileNotFoundError:
        # this makes sure the DAG is connected before
        # `make_count` has been executed
        return 1


rule __default__:
    input: merged_file, count_file


checkpoint make_count:
    output: count_file
    shell:
        with open(str(output), "w") as cf:
            from random import randint
            print(randint(2, 10), file=cf)


rule make_part:
    output: touch(part_file)


rule merge_parts:
    input:
        # NOTE missing `count_file` here
        lambda _: [part_file.format(n=n)  for n in range(get_count())]
    output: protected(merged_file)
    shell: "cat {input} > {output}"

Produce the error with these commands:

snakemake -j1
# this is only required in this example; I don't know how the mechanics in the original issue are
touch out/part-0.txt
snakemake -j1

This is the closest I could get. Here is the explanation of what happens:

  1. The first execution of snakemake creates the protected output merged_file but with incomplete input because the DAG, and consequently get_count(), is not updated before executing the rule. You will only see one part file out/part-0.txt.
  2. Trigger a revaluation of merge_parts by touching the part file. This is only required in this example; I don't know how the mechanics in the original issue are.
  3. Snakemake executes merge_parts once more but this time with all the parts and we get a ProtectedOutputException.

Since I originally did not call touch or anything, Snakemake was executing the second invocation of the offending rule (merge_parts in the example) with the corrupted DAG from the first invocation and the corruption caused it to believe it was done somehow. That is my hypothesis, anyway.

@a-ludi a-ludi closed this as completed Jun 18, 2021
a-ludi added a commit to a-ludi/dentist that referenced this issue Jun 22, 2021
- `propagate_mask_back_to_reference` did not declare its dependency on
  the reads DB which caused the Snakemake DAG to get mangled
- the issue was reported and resolved at
  snakemake/snakemake#884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant