Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

nhartwic · 2020-09-20T16:46:45Z

I have a hybrid assembly workflow that mostly works fine. However, I've noticed that when I try to continue the workflow after a succesful partial run, tibanna wants to go through the trouble of rerunning the last job it already finished. For reference, this is the contents of the default remote prefix dir...

2020-08-13 17:38:18          0
2020-09-20 05:50:08 3683255812 Pennycress_1326_BWA_002_S2_R1_001.fastq.gz
2020-09-20 05:51:24 3720172921 Pennycress_1326_BWA_002_S2_R2_001.fastq.gz
2020-08-18 04:18:40        553 busco_summary_penny_1326.flye.racon3.txt
2020-08-18 04:18:49     139349 busco_table_penny_1326.flye.racon3.tsv
2020-08-17 23:06:17  430122007 penny_1326.flye.fasta
2020-08-17 23:06:19  809289877 penny_1326.flye.gfa
2020-08-18 04:18:37  420515794 penny_1326.flye.racon1.fasta
2020-08-18 04:18:44  419861456 penny_1326.flye.racon2.fasta
2020-08-18 04:18:40  419636936 penny_1326.flye.racon3.fasta
2020-08-16 23:54:05 7354043622 pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz
2020-08-13 18:24:14 9676546743 pennycress_1326.S003B2.S004B2.all.fastq.gz

Note that "penny_1326.flye.racon3.fasta" exists. It was produced at the same time both the other racon runs were completed. We recently completed short read sequencing and are now polishing the assembly using the short reads (and a tool called Pilon). This operation accepts "penny_1326.flye.racon3.fasta" as input and improves the sequence quality by aligning short reads and editing the assembly as needed. This all seems normal but when I go to run the workflow, snakemake/tibanna wants to remake "penny_1326.flye.racon3.fasta". See the dry run here...

Building DAG of jobs...
Job counts:
        count   jobs
        1       all
        1       busco
        1       pilon
        1       racon
        4
[Sun Sep 20 16:40:08 2020]

group job polish (jobs in lexicogr. order):

    [Sun Sep 20 16:40:08 2020]
    rule busco:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta
        output: salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt, salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv
        jobid: 2
        wildcards: base=penny_1326.flye.racon3.pilon1
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        busco  -i salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta -c 16 -o penny_1326.flye.racon3.pilon1 -l eudicots_odb10  --mode geno
        mv penny_1326.flye.racon3.pilon1/run_eudicots_odb10/short_summary.txt salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt
        mv penny_1326.flye.racon3.pilon1/run_eudicots_odb10/full_table.tsv salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv


    [Sun Sep 20 16:40:08 2020]
    rule pilon:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta, salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R1_001.fastq.gz, salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R2_001.fastq.gz
        output: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta
        jobid: 1
        wildcards: base=penny_1326.flye.racon3, n=1
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        minimap2  -ax sr -t 16 salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R1_001.fastq.gz salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R2_001.fastq.gz | samtools sort > salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam
        samtools index salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam
        pilon  -Xmx54000M --genome salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta --bam salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam --output penny_1326.flye.racon3.pilon1 --threads 16


    [Sun Sep 20 16:40:08 2020]
    rule racon:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta, salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz
        output: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta
        jobid: 3
        wildcards: base=penny_1326.flye, n=3
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        minimap2  -x map-ont -t 16 salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz > salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta.mm2.paf
        racon  -t 16 salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta.mm2.paf salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta > salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta


[Sun Sep 20 16:40:08 2020]
localrule all:
    input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta, salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt, salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv
    jobid: 0
    resources: disk_mb=1000000

Job counts:
        count   jobs
        1       all
        1       busco
        1       pilon
        1       racon
        4
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Note that the last rule in the dry run is trying to produce "penny_1326.flye.racon3.fasta" even though it already exists. Any idea why this is happening? I've never gotten this behavior when not using tibanna as the backend.

The text was updated successfully, but these errors were encountered:

nhartwic · 2020-09-20T17:01:18Z

Upon further investigation, this seems to be a result of Grouped jobs not being time stamped correctly when uploading to AWS. Tibanna wants to rerun the racon rule that produced "penny_1326.flye.racon3.fasta" because "penny_1326.flye.racon3.fasta" happened to get uploaded to aws before "penny_1326.flye.racon2.fasta" and "penny_1326.flye.racon2.fasta" is an input for the rule that produces "penny_1326.flye.racon3.fasta". This was only possible because I was executing all racon jobs as a single group in my earlier execution which causes all three racon files to apparently be uploaded in an arbitrary order.

The user level fix is to edit the time stamps in AWS, and clear out the ".snakemake" cache. Long term sollution is to make tibanna upload output files in the order defined by the DAG representing the group job being executed. Not being super familiar with the source code of snakemake or tibanna, I'm not certain this is an easy or even possible update to make

SooLee · 2020-09-21T15:51:50Z

@nhartwic Thank you for reporting this. The best fix would be to preserve time stamp for the output files but as far as I know AWS S3 does not provide that option. The output files from a given instance ('group') can be sorted before being uploaded to S3, but that still would not guarantee all the out files are uploaded in the correct order if there are multiple instances running concurrently (parallel independent group jobs). I'll see if I can at least get the files ordered within a group.

nhartwic · 2020-09-21T19:00:35Z

Sounds good. As long as output files for each group is ordered correctly, that is probably sufficient as any dependencies of the group must have been uploaded prior to the groups execution and any downstream products must get uploaded after, just due to the way groups get spawned. The only potential errors would be if multiple partial runs were being performed in which the dag topology meaningfully changes, but I'd argue that in such cases, the rules themselves are the problem and Snakemake in general can't resolve the issue. As an example, imagine run 1 has structure "rule A -> Rule B" and run 2 has structure "rule B -> rule A -> rule C". This example should probably never happen and ought to be avoided by workflow writers.

nhartwic changed the title ~~Tibanna sometimes thinks files need to be remade~~ Tibanna Group Jobs Upload Files Apparently Arbitrarily Sep 20, 2020

nhartwic changed the title ~~Tibanna Group Jobs Upload Files Apparently Arbitrarily~~ Tibanna Group Jobs Upload Files in Apparently Arbitrary Order Sep 20, 2020

nigiord mentioned this issue May 19, 2021

Forked repo isn't used on EC2 instances even though it is declared at deployment #337

Closed

nigiord pushed a commit to nigiord/tibanna that referenced this issue May 21, 2021

Fix 4dn-dcic#300 by sorting output targets before uploading

fd40332

nigiord mentioned this issue May 21, 2021

Fix #300 by sorting output targets before uploading #340

Open

nigiord pushed a commit to nigiord/tibanna that referenced this issue Feb 14, 2022

Fix 4dn-dcic#300 by sorting output targets before uploading

6cb036c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

nhartwic commented Sep 20, 2020 •

edited

nhartwic commented Sep 20, 2020

SooLee commented Sep 21, 2020

nhartwic commented Sep 21, 2020

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

Comments

nhartwic commented Sep 20, 2020 • edited

nhartwic commented Sep 20, 2020

SooLee commented Sep 21, 2020

nhartwic commented Sep 21, 2020

nhartwic commented Sep 20, 2020 •

edited