Nanoflow: a NANOpore sequencing data bioinformatics workFLOW

Nanoflow is a pipeline written in snakemake to automate many of the steps of quality control, de novo assemblies and genome annotation in whole genome sequencing analysis, using Oxford Nanopore sequencing data.

20190104: add github page

Installation

Install Conda environment manager, and make sure the ~/.condarc is your home directory.

nano ~/.condarc

Copy the following to the file.

channels:
- bioconda
- conda-forge
- defaults

Install GCC5, by cloning Jesse's conda-gcc5 repository and create an new conda environment nanoflow.

cd ~
git clone https://github.com/ressy/conda-gcc5.git
cd conda-gcc5
bash setup.sh nanoflow

Clone this repository into a local directory and install the packages into nanoflow environment.

git clone https://github.com/zhaoc1/nanoflow.git nanoflow
cd nanoflow
source activate nanoflow

conda install -n nanoflow -c bioconda snakemake=4.8.1
conda env update --name=nanoflow --file env.yml

Clone Ryan Wick's Basecalling-comparison repository

mkdir local
cd local
git clone https://github.com/rrwick/Basecalling-comparison.git

Download other packages into local directory

## Canu 1.8
wget https://github.com/marbl/canu/archive/v1.8.tar.gz
tar -xvf v1.8.tar.gz
cd canu-1.8/src
make -j 4

## Nanopolish v0.9.0
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish
make

## Unicycler
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install

## set up for Quast
git clone https://github.com/lucian-ilie/E-MEM.git
cd E-MEM
make

Usage

Basecalling: the raw fast5 signal data files were basecalled using ONT’s Albacore command line tool (v.2.2.7), with barcode demultiplexing and fastq output. You can perform the basecalling step either by snakemake or run the run_albacore.sh bash script, with proper directory info.

snakemake --configfile all_basecalling

Preprocess: quality filter, confidently-binned, and subsampled subsample long reads

snakemake --configfile config.yml --cores 8 all_qc

Hybrid assembly option 1: Canu + Nanopolish (+ Circlator + Pilon)
- long reads only product: long reads only assembly polished by signal data, can be used by hybrid assembly option 3.

snakemake --configfile config.yaml --cores 8 all_draft1

Hybrid assembly option 2: Unicycler (default mode)
- depth=X in the FASTA header: to preserve the relative depths. This is mainly used for plasmid sequences, which should be more represented in the reads than the chromosomal sequence.

snakemake --configfile config.yaml --cores 8 all_draft2

Hybrid assembly option 3: Unicycler (existing long reads assembly option)

snakemake --configfile config.yaml --cores 8 all_draft3

For the final draft genome, a common practice is to choose two of the assemblies results you are happy with, assess them with the provided reference genome, compare one to the other, and map reads back to the draft genomes to calcualate the coverage. All of these tasks are implemented in the assembly.rules.
- We sequenced C diff isoaltes at PCMP, and therefore in the run_prokka rules, I used the genus level prokka database. If you have a different organisms to study, please build the prokka genus database by yourself and change the corresponding lines in the run_prokka rule.

snakemake --configfile config.yaml --cores 8 all_final

Assembly assess and comparison

Metrics description
- Misjoins: locations where two adjacent sequences in the assembly should be split apart and placed at distinct locations in order to match the reference.
- Relocation: a misjoin where a segments needs to be moved elsewhere on the chromosome.
- Misassemblies: QUAST categories misassemblies as either local (less than 1kbp discrepancy) or extensive (more than 1 kbp discrepancy)
A good reference guide for interpretting the dot plot is available here.
Some good tutorials 😳
- Align two draft sequences using MUMmer.
- Evaluate the assembly using MUMmer.
- Assembly evaluation with QUAST.
- Multiple assemblies comparison using QUAST.
- Highly similar sequences with rearrangments using run-mummer3 [TODO].
- Assembly to assembly comparisons using Minimap2 [TODO].
- Microbial genomics tutorials using PacBio long reads from ABRPI-Training.
- de.NBI Nanopore Training Course.
Wish you knew sooner 😔
- Minimap2 and the future of BWA, by Heng Li's blog.
- Long reads assembly: indels cause interrupted genes, by Mick Watson's blog. I also have an example for this issue demo_interrupted_genes
- This paper talks about the commonly incorrect use of the max_target_seqs of BLAST.
Two optional features provided by Nanoflow:
1. assess draft genomes using QUAST
```
snakemake --configfile config.yaml _all_quast --use-conda
```
1. IGV: short/long reads mapped to draft assembly
- Refer to the subworkflow of sunbeam: sbx_igv
```
snakemake --configfile config.yaml _all_igv
```

Generate bioinformatics report refer to bioinfo_report.Rmd. An example output is shown in bioinfo_report.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
.circleci		.circleci
bin		bin
docs		docs
rules		rules
.gitignore		.gitignore
ASMicrobe_Poster_Chunyu.pdf		ASMicrobe_Poster_Chunyu.pdf
README.md		README.md
Snakefile		Snakefile
barcodes.txt		barcodes.txt
bioinfo_report.Rmd		bioinfo_report.Rmd
bioinfo_report.pdf		bioinfo_report.pdf
cluster.json		cluster.json
config.yml		config.yml
demo_interruptted_genes.pdf		demo_interruptted_genes.pdf
env.yml		env.yml
index.html		index.html
quast.yaml		quast.yaml
run_albacore.sh		run_albacore.sh
sbx_igv.py		sbx_igv.py

zhaoc1/nanoflow

Folders and files

Latest commit

History

Repository files navigation

Nanoflow: a NANOpore sequencing data bioinformatics workFLOW

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages