Choose Kraken2-filtered host #350

tetedange13 · 2023-01-12T13:00:02Z

Description of feature

Hi,

This is probably related to issue #113 (comment)

We are trying to shift from a detection of solely Sars-cov-2 (using Artic primers) to a more "metagenomics" approach (using Illumina RVOP kit)
=> Looks like we can stick with viralrecon for that, so first : thanks for this wonderful tool !

Currently filtering of host reads with Kraken is based on keeping only "unassigned" reads after classification against a Kraken DB composed of Human only (deduced from here)

Would be great if we could run Kraken against a more diverse index (like "Standard" one) through --kraken2_db, then filter out host reads by specifying host taxid (new parameter such as --kraken_host_taxid)
=> This way we could keep filtering host reads, but also use Kraken at its full potential (to have a rapid global look of what's inside our sample)

It could be based on extract_kraken_reads.py from KrakenTools suite
=> But I guess it will require to add this script as an nf-core module first (or maybe I missed it ?)

Thanks again !
Kind regards,
Felix.

The text was updated successfully, but these errors were encountered:

drpatelh · 2023-03-05T09:37:33Z

Hi @tetedange13 ! This is an interesting approach. So if I understand correctly, rather than creating a subsetted Kraken database we would provide a full-fledged one and provide the taxonomy id of a genome in the database instead? Could you provide some more information (or even a PR would be better?!) that shows how we could add this to the pipeline? For now, it doesn't have to be an nf-core module - we can add that later.

tetedange13 · 2023-04-03T09:29:48Z

Hi @drpatelh,

No need to provide a full-fledged Kraken index, you can completely keep going with your subsetted one
=> My idea is simply to add a step of extraction of reads NOT belonging to a user-specified taxID, handled by dedicated KrakenTools script : extract_kraken_reads.py

This way --kraken2_{variants,assembly}_host_filter would be compatible with usage of a bigger Kraken index (Human reads will be removed by default)
=> We could also imagine a case where someone has a host to remove, which is different from "Human" (still in the case of using a bigger Kraken index)

Command to run Kraken :

kraken2 \
        --db kraken2_human \
        --threads 8 \
        --unclassified-out ${prefix}.unclassified#.fastq \
        --classified-out ${prefix}.classified#.fastq \
        --report ${prefix}.kraken2.report.txt \
        --gzip-compressed \
        --paired \
        --report-zero-counts \
        ${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz

Would become :

kraken2 \
        --db kraken2_human \
        --threads 8 \
        --output ${prefix}.classifiedreads.kraken2.txt
        --report ${prefix}.kraken2.report.txt \
        --gzip-compressed \
        --paired \
        --report-zero-counts \
        ${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz

=> Instead of outputting {un,}classified{_1,_2}.fastq, we simply collect Kraken output to a file with --output
=> Kraken output actually lists each FASTQ headers, but adding info of Kraken classification (see: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format)

Then would use extract_kraken_reads.py on Kraken output file (Kraken report is required too)
=> To exclude a given species by its taxID (default would be Human=9606) ; "unclassified" reads should be kept too
(there is also a parameter--include-parents, to exlude reads outside of Human + its direct ancestry, but it may be risky when using a Kraken index with closely related species ?)
(see 3rd example of : https://github.com/jenniferlu717/KrakenTools#4-extract_kraken_readspy---exclude-flag)

extract_kraken_reads.py \
        -k ${prefix}.classifiedreads.kraken2.txt\
        --report ${prefix}.kraken2.report.txt \
        -s1 ${prefix}_1.trim.fastq.gz -s2 {prefix}_2.trim.fastq.gz \
        --taxid 9606 \
        --exclude \
        -o {prefix}.unclassified_1.fastq.gz -o2 {prefix}.unclassified_2.fastq.gz

=> In my example I kept current filename convention ${prefix}.unclassified_{1,2} for FASTQ produced by "extract_kraken_reads.py"
=> But in case we use a more complete Kraken index, these files won't contain only "unclassified" reads, strictly speaking

tetedange13 · 2023-04-03T09:39:14Z

I am sorry but I do not know Nextflow enough to be able to submit a PR
=> But here are some more insights to help implementation :

I tested locally this 'Kraken + extract_kraken_reads' solution (outside viralrecon)
=> Produced FASTQ were well identical to ${prefix}.unclassified_{_1,_2}.fastq.gz obtained with viralrecon (except I am on v2.4)
Inside Kraken nf-core module, there is already an option to enable writting of Kraken output ('--output' parameter)

viralrecon/modules/nf-core/kraken2/kraken2/main.nf

Line 34 in 3731dd3

def readclassification_option = save_reads_assignment ? "--output ${prefix}.kraken2.classifiedreads.txt" : "--output /dev/null"
There is an existing nf-core module krakentools/combinetreports
=> Which can be used as a start, because mentionned Singularity container "krakentools:1.2--pyh5e36f6f_0" contains extract_kraken_reads.py script too
=> You could create a copy of this one as a local module for extract_kraken_reads
Regarding expected inputs for this 'krakentools/extractkrakenreads' module :

A pair of FASTQ (if paired-end)
A Kraken report
A Kraken output file
A taxID, that should be exposed as a CLI argument
(extract_kraken_reads.py is actually able to process a list of taxID, but maybe for later ?)

=> And expected output is simply a new pair of FASTQ (if paired-end), possibly called ${prefix}.unclassified{_1,_2}.fastq.gz

Hope this helps !
Have a nice day,
Felix.

tetedange13 added the enhancement Improvement for existing functionality label Jan 12, 2023

tetedange13 changed the title ~~Choose Kraken2-removed host~~ Choose Kraken2-filtered host Jan 12, 2023

drpatelh added this to the 2.6 milestone Mar 5, 2023

drpatelh modified the milestones: 2.6, 2.7 Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose Kraken2-filtered host #350

Choose Kraken2-filtered host #350

tetedange13 commented Jan 12, 2023

drpatelh commented Mar 5, 2023

tetedange13 commented Apr 3, 2023

tetedange13 commented Apr 3, 2023

Choose Kraken2-filtered host #350

Choose Kraken2-filtered host #350

Comments

tetedange13 commented Jan 12, 2023

Description of feature

drpatelh commented Mar 5, 2023

tetedange13 commented Apr 3, 2023

tetedange13 commented Apr 3, 2023