Skip to content
This repository has been archived by the owner on May 31, 2024. It is now read-only.
/ wink Public archive

What's In my Nanopore reads, with Kraken2, in real-time

License

Notifications You must be signed in to change notification settings

angelovangel/wink

Repository files navigation

WINK

What's In my Nanopore reads, with Kraken2, in real-time

Archived, not maintained anymore

Description

WINK is a platform for real-time phylogenetic classification and species quantification for Nanopore sequencing data, based on kraken2 and bracken. It can be used both in real-time (monitor a specified folder for new reads, e.g. fastq_pass and continuously update results) and post-run (collect all the reads and perform analysis).1 The software consists of two parts - a nextflow pipeline (can be executed on its own) and a graphical user interface (a Shiny app) which collects the output of the nextflow pipeline and diplays it as an interactive dashboard page.

Performance

The performance is that of kraken2/bracken. As an example, here are the results of a small Nanopore Flongle run (11k reads) with the Zymo HMW DNA standard.

Theoretical and measured species and species abundance (in %) in the Zymo HMW DNA standard. The theoretical composition is as supplied by Zymo.

<style> table { width:60%; } </style>
name theoretical measured
Staphylococcus aureus 19.60 20.11
Enterococcus faecalis 18.80 16.28
Listeria monocytogenes 17.80 14.93
Salmonella enterica 11.20 12.34
Escherichia coli 10.90 11.32
Pseudomonas aeruginosa 7.80 8.63
Bacillus subtilis 13.20 7.71
Bacillus intestinalis NA 3.39
Bacillus sp. LM 4-2 NA 0.53
Bacillus velezensis NA 0.52

Apart from the Bacillus misassignments, the species profiling and the abundance estimation are pretty good, even with this small dataset.

Install

Database

Download or prepare the kraken2 database and the bracken indexes that you want to use. A very good database/index source with different combinations of RefSeq genomes is Ben Langmead's index zone.

nextflow part

  • If you don't have nextflow:
curl -s https://get.nextflow.io | bash
  • get the latest WINK version from github:
git clone https://github.com/angelovangel/wink.git

The nextflow pipeline can be run with docker or conda (e.g. use --profile docker), in which case you don't need to manually install the dependencies. These are:

  • seqkit
  • kraken2
  • bracken
  • R

The results from the nextflow pipeline are by default saved in results-wink in the nextflow launch directory.

GUI part (Shiny)

The Shiny app dependencies are managed with renv. This means that it is enough to just start the wink.Rproj project and call renv::restore().

Running the pipeline and explanation of the results

WINK can be run either via the Shiny app (in a browser, no command line needed), or by executing the nextflow pipeline from the command line2.

The input for the pipeline is one of:

  • the output folder of the MinKNOW software (usually this is the fastq_pass folder) or
  • any other folder where the basecalled data (fastq files) accumulates during a run. The run may be barcoded or not

The results can be monitored in real time via the app. In addition, the nextflow pipeline outputs in real-time (all in results-wink):

  • latest-fastq - where the latest fastq files are collected, one file per sample. By default, the MinKNOW software writes 4k reads per fastq file and typically one sample can generate many such fastq files. The WINK pipeline merges all the available fastq files belonging to one sample, i.e.
data
├── barcode01
│   ├── PAE58908_pass_barcode01_d2c5c063_0.fastq
│   ├── PAE58908_pass_barcode01_d2c5c063_1.fastq
│   ├── PAE58908_pass_barcode01_d2c5c063_2.fastq

becomes barcode01.fastq. The latest-fastq is continously updated as new reads are generated.

  • latest-stats - tables (one per sample/barcode) with the following columns: format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%). These are also continuously updated.

  • latest-bracken- a table with the bracken quantification of species abundance in the samples with the following columns: file name taxonomy_id kraken_assigned_reads new_est_reads freq

Under the hood

The pipeline is built with nextflow and Shiny, using some built-in nextflow functions to watch for new reads in the input folder. As new reads are generated, their phylogenetic assignment is performed with kraken2 and the relative species composition is determined with bracken. In parallel, various statistics about the fastq reads are collected and updated during the run.


1: The reads generated before the pipeline is started are also included in the analysis by changing their timestamps. Take care if you rely on this information for other purposes.

2: The nextflow pipeline does not finish, because it keeps watching for new files. When run via the Shiny app, it is killed by the R process when you close the browser window. When run on the command line, you can kill it with Ctr-C.

About

What's In my Nanopore reads, with Kraken2, in real-time

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published