Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

Open
nhartwic opened this issue Sep 20, 2020 · 3 comments
Open

Tibanna Group Jobs Upload Files in Apparently Arbitrary Order #300

nhartwic opened this issue Sep 20, 2020 · 3 comments

Comments

@nhartwic
Copy link

nhartwic commented Sep 20, 2020

I have a hybrid assembly workflow that mostly works fine. However, I've noticed that when I try to continue the workflow after a succesful partial run, tibanna wants to go through the trouble of rerunning the last job it already finished. For reference, this is the contents of the default remote prefix dir...

2020-08-13 17:38:18          0
2020-09-20 05:50:08 3683255812 Pennycress_1326_BWA_002_S2_R1_001.fastq.gz
2020-09-20 05:51:24 3720172921 Pennycress_1326_BWA_002_S2_R2_001.fastq.gz
2020-08-18 04:18:40        553 busco_summary_penny_1326.flye.racon3.txt
2020-08-18 04:18:49     139349 busco_table_penny_1326.flye.racon3.tsv
2020-08-17 23:06:17  430122007 penny_1326.flye.fasta
2020-08-17 23:06:19  809289877 penny_1326.flye.gfa
2020-08-18 04:18:37  420515794 penny_1326.flye.racon1.fasta
2020-08-18 04:18:44  419861456 penny_1326.flye.racon2.fasta
2020-08-18 04:18:40  419636936 penny_1326.flye.racon3.fasta
2020-08-16 23:54:05 7354043622 pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz
2020-08-13 18:24:14 9676546743 pennycress_1326.S003B2.S004B2.all.fastq.gz

Note that "penny_1326.flye.racon3.fasta" exists. It was produced at the same time both the other racon runs were completed. We recently completed short read sequencing and are now polishing the assembly using the short reads (and a tool called Pilon). This operation accepts "penny_1326.flye.racon3.fasta" as input and improves the sequence quality by aligning short reads and editing the assembly as needed. This all seems normal but when I go to run the workflow, snakemake/tibanna wants to remake "penny_1326.flye.racon3.fasta". See the dry run here...

Building DAG of jobs...
Job counts:
        count   jobs
        1       all
        1       busco
        1       pilon
        1       racon
        4
[Sun Sep 20 16:40:08 2020]

group job polish (jobs in lexicogr. order):

    [Sun Sep 20 16:40:08 2020]
    rule busco:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta
        output: salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt, salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv
        jobid: 2
        wildcards: base=penny_1326.flye.racon3.pilon1
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        busco  -i salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta -c 16 -o penny_1326.flye.racon3.pilon1 -l eudicots_odb10  --mode geno
        mv penny_1326.flye.racon3.pilon1/run_eudicots_odb10/short_summary.txt salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt
        mv penny_1326.flye.racon3.pilon1/run_eudicots_odb10/full_table.tsv salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv


    [Sun Sep 20 16:40:08 2020]
    rule pilon:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta, salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R1_001.fastq.gz, salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R2_001.fastq.gz
        output: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta
        jobid: 1
        wildcards: base=penny_1326.flye.racon3, n=1
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        minimap2  -ax sr -t 16 salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R1_001.fastq.gz salk-tm-dev/pennycress/1326/Pennycress_1326_BWA_002_S2_R2_001.fastq.gz | samtools sort > salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam
        samtools index salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam
        pilon  -Xmx54000M --genome salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta --bam salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta.mm2.sr.bam --output penny_1326.flye.racon3.pilon1 --threads 16


    [Sun Sep 20 16:40:08 2020]
    rule racon:
        input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta, salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz
        output: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta
        jobid: 3
        wildcards: base=penny_1326.flye, n=3
        threads: 16
        resources: disk_mb=1000000, mem_mb=60000


        minimap2  -x map-ont -t 16 salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz > salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta.mm2.paf
        racon  -t 16 salk-tm-dev/pennycress/1326/pennycress_1326.S003B2.S004B2.all.dedup.fastq.gz salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta.mm2.paf salk-tm-dev/pennycress/1326/penny_1326.flye.racon2.fasta > salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.fasta


[Sun Sep 20 16:40:08 2020]
localrule all:
    input: salk-tm-dev/pennycress/1326/penny_1326.flye.racon3.pilon1.fasta, salk-tm-dev/pennycress/1326/busco_summary_penny_1326.flye.racon3.pilon1.txt, salk-tm-dev/pennycress/1326/busco_table_penny_1326.flye.racon3.pilon1.tsv
    jobid: 0
    resources: disk_mb=1000000

Job counts:
        count   jobs
        1       all
        1       busco
        1       pilon
        1       racon
        4
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Note that the last rule in the dry run is trying to produce "penny_1326.flye.racon3.fasta" even though it already exists. Any idea why this is happening? I've never gotten this behavior when not using tibanna as the backend.

@nhartwic
Copy link
Author

Upon further investigation, this seems to be a result of Grouped jobs not being time stamped correctly when uploading to AWS. Tibanna wants to rerun the racon rule that produced "penny_1326.flye.racon3.fasta" because "penny_1326.flye.racon3.fasta" happened to get uploaded to aws before "penny_1326.flye.racon2.fasta" and "penny_1326.flye.racon2.fasta" is an input for the rule that produces "penny_1326.flye.racon3.fasta". This was only possible because I was executing all racon jobs as a single group in my earlier execution which causes all three racon files to apparently be uploaded in an arbitrary order.

The user level fix is to edit the time stamps in AWS, and clear out the ".snakemake" cache. Long term sollution is to make tibanna upload output files in the order defined by the DAG representing the group job being executed. Not being super familiar with the source code of snakemake or tibanna, I'm not certain this is an easy or even possible update to make

@nhartwic nhartwic changed the title Tibanna sometimes thinks files need to be remade Tibanna Group Jobs Upload Files Apparently Arbitrarily Sep 20, 2020
@nhartwic nhartwic changed the title Tibanna Group Jobs Upload Files Apparently Arbitrarily Tibanna Group Jobs Upload Files in Apparently Arbitrary Order Sep 20, 2020
@SooLee
Copy link
Member

SooLee commented Sep 21, 2020

@nhartwic Thank you for reporting this. The best fix would be to preserve time stamp for the output files but as far as I know AWS S3 does not provide that option. The output files from a given instance ('group') can be sorted before being uploaded to S3, but that still would not guarantee all the out files are uploaded in the correct order if there are multiple instances running concurrently (parallel independent group jobs). I'll see if I can at least get the files ordered within a group.

@nhartwic
Copy link
Author

Sounds good. As long as output files for each group is ordered correctly, that is probably sufficient as any dependencies of the group must have been uploaded prior to the groups execution and any downstream products must get uploaded after, just due to the way groups get spawned. The only potential errors would be if multiple partial runs were being performed in which the dag topology meaningfully changes, but I'd argue that in such cases, the rules themselves are the problem and Snakemake in general can't resolve the issue. As an example, imagine run 1 has structure "rule A -> Rule B" and run 2 has structure "rule B -> rule A -> rule C". This example should probably never happen and ought to be avoided by workflow writers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants