Skip to content

Nextflow pipeline to assemble genomes from nanopore reads.

License

Notifications You must be signed in to change notification settings

nschan/nf-arassembly

Repository files navigation

DOI

nf-arassembly

Assembly pipeline for arabidopsis genomes from nanopore sequencing written in nextflow. Should also work for other species.

Procedure

  • Extract all fastq.gz files in the readpath folder into a single fastq file. By default this is skipped, enable with --collect.
  • Barcodes and adaptors will be removed using porechop. By default this is skipped, enable with --porechop.
  • Read QC is done via nanoq.
  • k-mer based assessment of the reads via Jellyfish and genomescope
  • Assemblies are performed with flye.
  • Polishing is done using medaka, and scaffolding via LINKS, longstitch and / or ragtag.
  • Optional short-read polishing can be done using pilon. By default this is not done, enable with --polish_pilon, requires different samplesheet with shortreads.
  • Annotations are lifted from reference using liftoff.
  • Quality of each stage is assessed using QUAST and BUSCO (standalone).

Parameters

See also schema.md

Parameter Effect
--samplesheet Path to samplesheet
--collect Are the provided reads a folder (true) or a single fq files (default: false )
--use_ref Use a refence genome? (default: true)
--porechop Run porechop? (default: false)
--kmer_length kmer size for Jellyfish? (default: 21)
--read_length Read length for genomescope? If this is null (default), the median read length estimated by nanoq. will be used. If this is not null, the given value will be used for all samples.
--flye_mode The mode to be used by flye; default: "--nano-hq"
--genome_size Expected genome size for flye. If this is null (default), the haploid genome size for each sample will be estimated via genomescope. If this is not null, the given value will be used for all samples.
--flye_args Arguments to be passed to flye, default: none. Example: --flye_args '--genome-size 130g --asm-coverage 50'
--polish_medaka Polish using medaka, default: true
--medaka_model Model used by medaka, default: 'r1041_e82_400bps_hac@v4.2.0:consesus'
--polish_pilon Polish with short reads using pilon? Sefault: false
--busco_db Path to local BUSCO db?; default: /dss/dsslegfs01/pn73so/pn73so-dss-0000/becker_common/software/busco_db
--busco_lineage BUSCO lineage to use; default: brassicales_odb10
--scaffold_ragtag Scaffolding with ragtag? Default: false
--scaffold_links Scaffolding with LINKS? Default: false
--scaffold_longstitch Scaffolding with longstitch? Default: false
--lift_annotations Lift annotations from reference using liftoff? Default: true
--skip_flye Skip assembly with flye?, requires different samplesheet (!); Default: false
--skip_alignments Skip alignments with minimap2? requires different samplesheet (!); Default: false
--out Results directory, default: './results'

Graph

graph TD
  fastq[Reads fastq] --> porechop("porechop")
  porechop --> clean_reads(clean reads)
  fastq -. skip porechop .-> clean_reads
  clean_reads --> Readqc
  subgraph k-mers
  direction TB
  jellyfish --> genomescope
  end
  subgraph Readqc[Read QC]
  nanoq
  end
  clean_reads --> k-mers
  nanoq -. median read length .-> jellyfish
  clean_reads --> Assembly
  subgraph Assembly
  direction TB
  assembler[Flye]
  assembler --> asqc(QC: BUSCO & QUAST)
  assembler --> asliftoff(Annotation:Liftoff)
  end
  genomescope -. estimated genome size .-> Assembly
  subgraph Polish
  direction LR
  subgraph Medaka
  medaka[medaka] 
  medaka --> meliftoff(Annotation:Liftof)
  medaka --> meqc(QC: BUSCO & QUAST)
  end
  subgraph Pilon
  pilon[pilon] 
  pilon --> piliftoff(Annotation:Liftoff)
  pilon --> piqc(QC: BUSCO & QUAST)
  end
  Medaka -.-> Pilon
  end
  Assembly --> Polish
  subgraph Scaffold
  direction TB
  Longstitch
  Links
  RagTag
  end
  subgraph Longstitch
  direction TB
  longstitch[Longstitch] --> lsliftoff(Annotation:Liftoff)
  longstitch --> lsQC(QC: BUSCO & QUAST)
  end
  subgraph Links
  direction TB
  links[Links] --> liliftoff(Annotation:Liftoff)
  links --> liQC(QC: BUSCO & QUAST)
  end
  subgraph RagTag
  direction TB
  ragtag[RagTag] --> raliftoff(Annotation:Liftoff)
  ragtag --> raQC(QC: BUSCO & QUAST)
  end
  Assembly -. skip polishing .-> Scaffold
  Polish --> Scaffold

Tubemap

Tubemap

Usage

Clone this repo:

git clone https://github.com/nschan/nf-arassembly/

Standard Pipeline

The standard pipeline assumes nanopore reads (10.14).

The samplesheet must adhere to this format, including the header row. Please note the absence of spaces after the commas:

sample,readpath,ref_fasta,ref_gff
sampleName,path/to/reads,path/to/reference.fasta,path/to/reference.gff

To run the pipeline with a samplesheet on biohpc_gen:

nextflow run nf-arassembly --samplesheet 'path/to/sample_sheet.csv' \
                           --out './results' \
                           -profile charliecloud,biohpc_gen

No refence genome

If there is no reference genome available use --use_ref false to disable the reference genome. Liftoff should not be used without a reference, QUAST will no longer compare to reference.

Usage with PacBio reads

When pac-bio reads are used, i changing flye mode and skipping medaka.

--flye_mode '--pacbio-raw' --polish_medaka false

or, if HiFi reads are used:

--flye_mode '--pacbio-hifi' --polish_medaka false

Skipping Flye

In case you already have an assembly and would only like to check it with QUAST and polish use --skip_flye true

This mode requires a different samplesheet:

sample,readpath,assembly,ref_fasta,ref_gff
sampleName,path/to/reads,assembly.fasta.gz,reference.fasta,reference.gff

When skipping flye the original reads will be mapped to the assembly and the reference genome.

Skipping Flye and mappings

In case you have an assembly and have already mapped your reads to the assembly and the reference genome you can use --skip_flye true --skip_alignments true

This mode requires a different samplesheet:

sample,readpath,assembly,ref_fasta,ref_gff,assembly_bam,assembly_bai,ref_bam
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,reads_on_assembly.bam,reads_on_assembly.bai,reads_on_reference.bam

Polishing with pilon

The assemblies can optionally be polished using available short-reads using pilon. --polish_pilon

This requires additional information in the samplesheet: shortread_F, shortread_R and paired:

sample,readpath,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true

In a case where only single-reads are available, shortread_R should be empty, and paired should be false

Scaffolding

LINKS, longstitch and / or ragtag can be used for scaffolding.

Using liftoff

If lift_annotations is used (default), the annotations from the reference genome will be mapped to assemblies and scaffolds using liftoff. This will happen at each step of the pipeline where a new genome fasta is created, i.e. after assembly, after polishing and after scaffolding.

QUAST

QUAST will run with the following additional parameters:

        --eukaryote \\
        --glimmer \\
        --conserved-genes-finding \\

Acknowledgements

This pipeline builds on modules developed by nf-core.