Skip to content

zhaoc1/nanoflow

Repository files navigation

Nanoflow: a NANOpore sequencing data bioinformatics workFLOW

Nanoflow is a pipeline written in snakemake to automate many of the steps of quality control, de novo assemblies and genome annotation in whole genome sequencing analysis, using Oxford Nanopore sequencing data.

CircleCI

20190104: add github page

Installation

  1. Install Conda environment manager, and make sure the ~/.condarc is your home directory.
nano ~/.condarc

Copy the following to the file.

channels:
- bioconda
- conda-forge
- defaults
  1. Install GCC5, by cloning Jesse's conda-gcc5 repository and create an new conda environment nanoflow.
cd ~
git clone https://github.com/ressy/conda-gcc5.git
cd conda-gcc5
bash setup.sh nanoflow
  1. Clone this repository into a local directory and install the packages into nanoflow environment.
git clone https://github.com/zhaoc1/nanoflow.git nanoflow
cd nanoflow
source activate nanoflow

conda install -n nanoflow -c bioconda snakemake=4.8.1
conda env update --name=nanoflow --file env.yml
  1. Clone Ryan Wick's Basecalling-comparison repository
mkdir local
cd local
git clone https://github.com/rrwick/Basecalling-comparison.git
  1. Download other packages into local directory
## Canu 1.8
wget https://github.com/marbl/canu/archive/v1.8.tar.gz
tar -xvf v1.8.tar.gz
cd canu-1.8/src
make -j 4

## Nanopolish v0.9.0
git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish
make

## Unicycler
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install

## set up for Quast
git clone https://github.com/lucian-ilie/E-MEM.git
cd E-MEM
make

Usage

  1. Basecalling: the raw fast5 signal data files were basecalled using ONT’s Albacore command line tool (v.2.2.7), with barcode demultiplexing and fastq output. You can perform the basecalling step either by snakemake or run the run_albacore.sh bash script, with proper directory info.
snakemake --configfile all_basecalling
  1. Preprocess: quality filter, confidently-binned, and subsampled subsample long reads
snakemake --configfile config.yml --cores 8 all_qc
  1. Hybrid assembly option 1: Canu + Nanopolish (+ Circlator + Pilon)

    • long reads only product: long reads only assembly polished by signal data, can be used by hybrid assembly option 3.
snakemake --configfile config.yaml --cores 8 all_draft1
  1. Hybrid assembly option 2: Unicycler (default mode)

    • depth=X in the FASTA header: to preserve the relative depths. This is mainly used for plasmid sequences, which should be more represented in the reads than the chromosomal sequence.
snakemake --configfile config.yaml --cores 8 all_draft2
  1. Hybrid assembly option 3: Unicycler (existing long reads assembly option)
snakemake --configfile config.yaml --cores 8 all_draft3
  1. For the final draft genome, a common practice is to choose two of the assemblies results you are happy with, assess them with the provided reference genome, compare one to the other, and map reads back to the draft genomes to calcualate the coverage. All of these tasks are implemented in the assembly.rules.

    • We sequenced C diff isoaltes at PCMP, and therefore in the run_prokka rules, I used the genus level prokka database. If you have a different organisms to study, please build the prokka genus database by yourself and change the corresponding lines in the run_prokka rule.
snakemake --configfile config.yaml --cores 8 all_final
  1. Assembly assess and comparison
  • Metrics description

    • Misjoins: locations where two adjacent sequences in the assembly should be split apart and placed at distinct locations in order to match the reference.

    • Relocation: a misjoin where a segments needs to be moved elsewhere on the chromosome.

    • Misassemblies: QUAST categories misassemblies as either local (less than 1kbp discrepancy) or extensive (more than 1 kbp discrepancy)

  • A good reference guide for interpretting the dot plot is available here.

  • Some good tutorials 😳

  • Wish you knew sooner 😔

    • Minimap2 and the future of BWA, by Heng Li's blog.
    • Long reads assembly: indels cause interrupted genes, by Mick Watson's blog. I also have an example for this issue demo_interrupted_genes
    • This paper talks about the commonly incorrect use of the max_target_seqs of BLAST.
  • Two optional features provided by Nanoflow:

    1. assess draft genomes using QUAST
    snakemake --configfile config.yaml _all_quast --use-conda
    1. IGV: short/long reads mapped to draft assembly
    snakemake --configfile config.yaml _all_igv
  1. Generate bioinformatics report refer to bioinfo_report.Rmd. An example output is shown in bioinfo_report.pdf.