Change krona wrapper to read from summary report instead of read report #777

dpark01 · 2018-01-31T21:30:56Z

In our current invocations in the WDL workflow, we run kraken, and then krona, on a bunch of samples to save on DB staging time and such. Currently, the krona portion is run in a for-loop after kraken completes.

I'm observing that the krona portion sometimes takes longer than the kraken portion (e.g. multiple hours for krona vs. 1 hr for kraken on a lane of a 2500 high output run). This is both dumb and very costly: krona is only utilizing a single core on an instance that is sized very large for kraken's sake.

Edit: this issue will now focus on changing the krona entry point to run off of the kraken summary report file exclusively (with no other inputs, see comments below for details).

yesimon · 2018-01-31T21:36:07Z

Actually, we can just change krona to intepret the kraken report instead and run almost instantaneously. The only slight drawback is that it will report each taxon as having 1 read with weight = no of reads instead. (Krona has a feature to multiply the read count * additional weight variable for each taxon).

dpark01 · 2018-01-31T21:37:36Z

Wait, I've always wondered why krona couldn't run directly off the summary txt file, but you're saying it can. I don't understand the drawback then -- what would the difference be from the current behavior?

yesimon · 2018-01-31T21:39:14Z

The only difference would be krona reporting n_i reads with weight 1 versus 1 read with n_i weight.

dpark01 · 2018-01-31T21:44:12Z

Visually, the pies would look the same... that seems worthwhile... does it even need a taxonomy db anymore at that point?

yesimon · 2018-01-31T21:45:00Z

It still does but krona only needs its taxonomy.tab file which is <100 Mb

dpark01 · 2018-01-31T22:16:22Z

What, informationally, is in the tab file that isn’t already in the summary report file?

dpark01 · 2018-02-01T16:03:37Z

Hm. I kind of like the idea of omitting the taxonomy.tab file. Partly so that the user doesn't have to provide one more file (even if we default it), but mostly because it always bothers me that all these tools are potentially using different taxonomy databases that can get out of sync with each other.

According to this, it looks like the taxonomy.tab file is just:

taxid
tree distance from root
parent taxid
taxonomic rank in lowercase English ("superkingdom", "genus", "species")
name

According to the Kraken manual, the report file is:

Percentage of reads covered by the clade rooted at this taxon
Number of reads covered by the clade rooted at this taxon
Number of reads assigned directly to this taxon
A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
NCBI taxonomy ID
indented scientific name

So to create krona's taxonomy.tab file on the fly from kraken's summary report:

column 5
count the number of leading spaces in column 6
this will require a little stateful parsing, but can be done
this is just mapping the single letter from column 4 into a full lowercase word
strip leading spaces from column 6

dpark01 · 2018-02-02T14:19:59Z

I'm changing this particular Issue to focus on the new direction discussed here in the comments and will backlog it for some future time.

For reference, here was the original thought:

I propose splitting kraken from krona in the WDL tasks (they're already split on the Snakemake side) and invoking the krona task within a scatter (one sample per task) after kraken. The krona task would need no more than 4-5GB RAM and 1 CPU core. I'd set dx_instance_type to mem2_hdd2_x2 (most cost effective for this).

I will separately implement a quick and dirty change that invokes krona from GNU parallel within the same WDL task (instead of a bash for loop) just to improve things for now until we have time for this larger issue.

dpark01 added this to To Do (this release) in v1.19.2 via automation Jan 31, 2018

dpark01 changed the title ~~WDL: break krona from kraken~~ Change krona wrapper to read from summary report instead of read report Feb 2, 2018

dpark01 moved this from To Do (this release) to Backlog (not necessarily this release) in v1.19.2 Feb 2, 2018

dpark01 added this to Backlog (not necessarily this release) in v1.19.3 Feb 6, 2018

dpark01 mentioned this issue Feb 20, 2019

Revamp of metagenomics rules #927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change krona wrapper to read from summary report instead of read report #777

Change krona wrapper to read from summary report instead of read report #777

dpark01 commented Jan 31, 2018 •

edited

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

dpark01 commented Feb 1, 2018

dpark01 commented Feb 2, 2018

Change krona wrapper to read from summary report instead of read report #777

Change krona wrapper to read from summary report instead of read report #777

Comments

dpark01 commented Jan 31, 2018 • edited

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

yesimon commented Jan 31, 2018

dpark01 commented Jan 31, 2018

dpark01 commented Feb 1, 2018

dpark01 commented Feb 2, 2018

dpark01 commented Jan 31, 2018 •

edited