Assembly pipeline for arabidopsis genomes from nanopore sequencing written in nextflow
. Should also work for other species.
- Extract all fastq.gz files in the readpath folder into a single fastq file. By default this is skipped, enable with
--collect
. - Barcodes and adaptors will be removed using
porechop
. By default this is skipped, enable with--porechop
. - Read QC is done via
nanoq
. - k-mer based assessment of the reads via
Jellyfish
andgenomescope
- Assemblies are performed with
flye
. - Polishing is done using medaka, and scaffolding via
LINKS
,longstitch
and / orragtag
. - Optional short-read polishing can be done using
pilon
. By default this is not done, enable with--polish_pilon
, requires different samplesheet with shortreads. - Annotations are lifted from reference using
liftoff
. - Quality of each stage is assessed using
QUAST
andBUSCO
(standalone).
See also schema.md
Parameter | Effect |
---|---|
--samplesheet |
Path to samplesheet |
--collect |
Are the provided reads a folder (true ) or a single fq files (default: false ) |
--use_ref |
Use a refence genome? (default: true ) |
--porechop |
Run porechop ? (default: false ) |
--kmer_length |
kmer size for Jellyfish ? (default: 21) |
--read_length |
Read length for genomescope ? If this is null (default), the median read length estimated by nanoq . will be used. If this is not null , the given value will be used for all samples. |
--flye_mode |
The mode to be used by flye ; default: "--nano-hq" |
--genome_size |
Expected genome size for flye . If this is null (default), the haploid genome size for each sample will be estimated via genomescope . If this is not null , the given value will be used for all samples. |
--flye_args |
Arguments to be passed to flye , default: none . Example: --flye_args '--genome-size 130g --asm-coverage 50' |
--polish_medaka |
Polish using medaka , default: true |
--medaka_model |
Model used by medaka , default: 'r1041_e82_400bps_hac@v4.2.0:consesus' |
--polish_pilon |
Polish with short reads using pilon ? Sefault: false |
--busco_db |
Path to local BUSCO db?; default: /dss/dsslegfs01/pn73so/pn73so-dss-0000/becker_common/software/busco_db |
--busco_lineage |
BUSCO lineage to use; default: brassicales_odb10 |
--scaffold_ragtag |
Scaffolding with ragtag ? Default: false |
--scaffold_links |
Scaffolding with LINKS ? Default: false |
--scaffold_longstitch |
Scaffolding with longstitch ? Default: false |
--lift_annotations |
Lift annotations from reference using liftoff ? Default: true |
--skip_flye |
Skip assembly with flye ?, requires different samplesheet (!); Default: false |
--skip_alignments |
Skip alignments with minimap2 ? requires different samplesheet (!); Default: false |
--out |
Results directory, default: './results' |
graph TD
fastq[Reads fastq] --> porechop("porechop")
porechop --> clean_reads(clean reads)
fastq -. skip porechop .-> clean_reads
clean_reads --> Readqc
subgraph k-mers
direction TB
jellyfish --> genomescope
end
subgraph Readqc[Read QC]
nanoq
end
clean_reads --> k-mers
nanoq -. median read length .-> jellyfish
clean_reads --> Assembly
subgraph Assembly
direction TB
assembler[Flye]
assembler --> asqc(QC: BUSCO & QUAST)
assembler --> asliftoff(Annotation:Liftoff)
end
genomescope -. estimated genome size .-> Assembly
subgraph Polish
direction LR
subgraph Medaka
medaka[medaka]
medaka --> meliftoff(Annotation:Liftof)
medaka --> meqc(QC: BUSCO & QUAST)
end
subgraph Pilon
pilon[pilon]
pilon --> piliftoff(Annotation:Liftoff)
pilon --> piqc(QC: BUSCO & QUAST)
end
Medaka -.-> Pilon
end
Assembly --> Polish
subgraph Scaffold
direction TB
Longstitch
Links
RagTag
end
subgraph Longstitch
direction TB
longstitch[Longstitch] --> lsliftoff(Annotation:Liftoff)
longstitch --> lsQC(QC: BUSCO & QUAST)
end
subgraph Links
direction TB
links[Links] --> liliftoff(Annotation:Liftoff)
links --> liQC(QC: BUSCO & QUAST)
end
subgraph RagTag
direction TB
ragtag[RagTag] --> raliftoff(Annotation:Liftoff)
ragtag --> raQC(QC: BUSCO & QUAST)
end
Assembly -. skip polishing .-> Scaffold
Polish --> Scaffold
Clone this repo:
git clone https://github.com/nschan/nf-arassembly/
The standard pipeline assumes nanopore reads (10.14).
The samplesheet must adhere to this format, including the header row. Please note the absence of spaces after the commas:
sample,readpath,ref_fasta,ref_gff
sampleName,path/to/reads,path/to/reference.fasta,path/to/reference.gff
To run the pipeline with a samplesheet on biohpc_gen:
nextflow run nf-arassembly --samplesheet 'path/to/sample_sheet.csv' \
--out './results' \
-profile charliecloud,biohpc_gen
If there is no reference genome available use --use_ref false
to disable the reference genome.
Liftoff should not be used without a reference, QUAST will no longer compare to reference.
When pac-bio reads are used, i changing flye mode and skipping medaka.
--flye_mode '--pacbio-raw' --polish_medaka false
or, if HiFi reads are used:
--flye_mode '--pacbio-hifi' --polish_medaka false
In case you already have an assembly and would only like to check it with QUAST and polish use
--skip_flye true
This mode requires a different samplesheet:
sample,readpath,assembly,ref_fasta,ref_gff
sampleName,path/to/reads,assembly.fasta.gz,reference.fasta,reference.gff
When skipping flye the original reads will be mapped to the assembly and the reference genome.
In case you have an assembly and have already mapped your reads to the assembly and the reference genome you can use
--skip_flye true --skip_alignments true
This mode requires a different samplesheet:
sample,readpath,assembly,ref_fasta,ref_gff,assembly_bam,assembly_bai,ref_bam
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,reads_on_assembly.bam,reads_on_assembly.bai,reads_on_reference.bam
The assemblies can optionally be polished using available short-reads using pilon
.
--polish_pilon
This requires additional information in the samplesheet: shortread_F
, shortread_R
and paired
:
sample,readpath,ref_fasta,ref_gff,shortread_F,shortread_R,paired
sampleName,reads,assembly.fasta.gz,reference.fasta,reference.gff,short_F1.fastq,short_F2.fastq,true
In a case where only single-reads are available, shortread_R
should be empty, and paired
should be false
LINKS
, longstitch
and / or ragtag
can be used for scaffolding.
If lift_annotations
is used (default), the annotations from the reference genome will be mapped to assemblies and scaffolds using liftoff.
This will happen at each step of the pipeline where a new genome fasta is created, i.e. after assembly, after polishing and after scaffolding.
QUAST
will run with the following additional parameters:
--eukaryote \\
--glimmer \\
--conserved-genes-finding \\