Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose Kraken2-filtered host #350

Open
tetedange13 opened this issue Jan 12, 2023 · 3 comments
Open

Choose Kraken2-filtered host #350

tetedange13 opened this issue Jan 12, 2023 · 3 comments
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@tetedange13
Copy link

Description of feature

Hi,

This is probably related to issue #113 (comment)

We are trying to shift from a detection of solely Sars-cov-2 (using Artic primers) to a more "metagenomics" approach (using Illumina RVOP kit)
=> Looks like we can stick with viralrecon for that, so first : thanks for this wonderful tool !

Currently filtering of host reads with Kraken is based on keeping only "unassigned" reads after classification against a Kraken DB composed of Human only (deduced from here)

Would be great if we could run Kraken against a more diverse index (like "Standard" one) through --kraken2_db, then filter out host reads by specifying host taxid (new parameter such as --kraken_host_taxid)
=> This way we could keep filtering host reads, but also use Kraken at its full potential (to have a rapid global look of what's inside our sample)

It could be based on extract_kraken_reads.py from KrakenTools suite
=> But I guess it will require to add this script as an nf-core module first (or maybe I missed it ?)


Thanks again !
Kind regards,
Felix.

@tetedange13 tetedange13 added the enhancement Improvement for existing functionality label Jan 12, 2023
@tetedange13 tetedange13 changed the title Choose Kraken2-removed host Choose Kraken2-filtered host Jan 12, 2023
@drpatelh
Copy link
Member

drpatelh commented Mar 5, 2023

Hi @tetedange13 ! This is an interesting approach. So if I understand correctly, rather than creating a subsetted Kraken database we would provide a full-fledged one and provide the taxonomy id of a genome in the database instead? Could you provide some more information (or even a PR would be better?!) that shows how we could add this to the pipeline? For now, it doesn't have to be an nf-core module - we can add that later.

@drpatelh drpatelh added this to the 2.6 milestone Mar 5, 2023
@drpatelh drpatelh modified the milestones: 2.6, 2.7 Mar 12, 2023
@tetedange13
Copy link
Author

Hi @drpatelh,

No need to provide a full-fledged Kraken index, you can completely keep going with your subsetted one
=> My idea is simply to add a step of extraction of reads NOT belonging to a user-specified taxID, handled by dedicated KrakenTools script : extract_kraken_reads.py

This way --kraken2_{variants,assembly}_host_filter would be compatible with usage of a bigger Kraken index (Human reads will be removed by default)
=> We could also imagine a case where someone has a host to remove, which is different from "Human" (still in the case of using a bigger Kraken index)

Command to run Kraken :

kraken2 \
        --db kraken2_human \
        --threads 8 \
        --unclassified-out ${prefix}.unclassified#.fastq \
        --classified-out ${prefix}.classified#.fastq \
        --report ${prefix}.kraken2.report.txt \
        --gzip-compressed \
        --paired \
        --report-zero-counts \
        ${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz

Would become :

kraken2 \
        --db kraken2_human \
        --threads 8 \
        --output ${prefix}.classifiedreads.kraken2.txt
        --report ${prefix}.kraken2.report.txt \
        --gzip-compressed \
        --paired \
        --report-zero-counts \
        ${prefix}_1.trim.fastq.gz ${prefix}_2.trim.fastq.gz

=> Instead of outputting {un,}classified{_1,_2}.fastq, we simply collect Kraken output to a file with --output
=> Kraken output actually lists each FASTQ headers, but adding info of Kraken classification (see: https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format)

Then would use extract_kraken_reads.py on Kraken output file (Kraken report is required too)
=> To exclude a given species by its taxID (default would be Human=9606) ; "unclassified" reads should be kept too
(there is also a parameter--include-parents, to exlude reads outside of Human + its direct ancestry, but it may be risky when using a Kraken index with closely related species ?)
(see 3rd example of : https://github.com/jenniferlu717/KrakenTools#4-extract_kraken_readspy---exclude-flag)

extract_kraken_reads.py \
        -k ${prefix}.classifiedreads.kraken2.txt\
        --report ${prefix}.kraken2.report.txt \
        -s1 ${prefix}_1.trim.fastq.gz -s2 {prefix}_2.trim.fastq.gz \
        --taxid 9606 \
        --exclude \
        -o {prefix}.unclassified_1.fastq.gz -o2 {prefix}.unclassified_2.fastq.gz

=> In my example I kept current filename convention ${prefix}.unclassified_{1,2} for FASTQ produced by "extract_kraken_reads.py"
=> But in case we use a more complete Kraken index, these files won't contain only "unclassified" reads, strictly speaking

@tetedange13
Copy link
Author

I am sorry but I do not know Nextflow enough to be able to submit a PR
=> But here are some more insights to help implementation :

  1. I tested locally this 'Kraken + extract_kraken_reads' solution (outside viralrecon)
    => Produced FASTQ were well identical to ${prefix}.unclassified_{_1,_2}.fastq.gz obtained with viralrecon (except I am on v2.4)

  2. Inside Kraken nf-core module, there is already an option to enable writting of Kraken output ('--output' parameter)

    def readclassification_option = save_reads_assignment ? "--output ${prefix}.kraken2.classifiedreads.txt" : "--output /dev/null"

  3. There is an existing nf-core module krakentools/combinetreports
    => Which can be used as a start, because mentionned Singularity container "krakentools:1.2--pyh5e36f6f_0" contains extract_kraken_reads.py script too
    => You could create a copy of this one as a local module for extract_kraken_reads

  4. Regarding expected inputs for this 'krakentools/extractkrakenreads' module :

  • A pair of FASTQ (if paired-end)
  • A Kraken report
  • A Kraken output file
  • A taxID, that should be exposed as a CLI argument
    (extract_kraken_reads.py is actually able to process a list of taxID, but maybe for later ?)

=> And expected output is simply a new pair of FASTQ (if paired-end), possibly called ${prefix}.unclassified{_1,_2}.fastq.gz


Hope this helps !
Have a nice day,
Felix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants