Skip to content

Latest commit

 

History

History

test

Test dataset

This directory contains data from simulated hybrid genome with 5% divergence between parentals and 40% of loss of heterozygosity (LOH).

To run the test example, execute:

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 

Table of contents

Generation of test set

Below, you can find detailed description for regeneration of test dataset with any input genome.

Simulations

Test dataset is based on 100 Kb region from C. parapsilosis CDC317 [HE605202:100000-200000].

samtools faidx CANPA.fa HE605202:100000-200000 > ref.fa

Heterozygous genome

Heterozygous regions, 5% divergence and 40% LOH, have been simulated with fasta2diverged.py.

fasta2diverged.py -v -i ref.fa --loh 0.40 --dna -d 0.05 > ref.d05.l40.fa

Short reads

Two shotgun sequencing libraries including typical Illumina-related errors were simulated using GemSIM:

  • 100 bp paired-end reads with 600 bp insert [200X coverage]:
  • 600_1.fq.gz
  • 600_2.fq.gz
  • 50 bp mate-pair reads with 5 kb insert [20X coverage]:
  • 5000_1.fq.gz
  • 5000_2.fq.gz
# generate SNPs
~/src/GemSIM_v1.6/GemHaps.py -r ref.fa -g '.50,0 .50,6'

# 2x 50,000 reads
## 100 bp paired-end with 600 bp insert +/- 50 bp
~/src/GemSIM_v1.6/GemReads.py -r ref.fa -n 50000 -g ref.txt -l 100 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 600 -u 600 -s 50 -p
~/src/GemSIM_v1.6/GemReads.py -r ref.d05.l40.fa -n 50000 -g ref.txt -l 100 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 600.d05.l40 -u 600 -s 50 -p

# 2x 10,000 reads
## 50 bp mate-pairs with 5 kb insert +/- 500 bp
~/src/GemSIM_v1.6/GemReads.py -r ref.fa -n 10000 -g ref.txt -l 50 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 5000 -u 5000 -s 500 -p
~/src/GemSIM_v1.6/GemReads.py -r ref.d05.l40.fa -n 10000 -g ref.txt -l 50 -m /home/lpryszcz/src/GemSIM_v1.6/models/ill100v5_p.gzip -q 33 -o 5000.d05.l40 -u 5000 -s 500 -p

# combine both libraries in random order
## 600_
paste <(zcat 600_1.fastq.gz 600.d05.l40_1.fastq.gz) <(zcat 600_2.fastq.gz 600.d05.l40_2.fastq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 | "gzip > random.600.d05.l40_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random.600.d05.l40_2.fq.gz"}'
## 5000_
paste <(zcat 5000_1.fastq.gz 600.d05.l40_1.fastq.gz) <(zcat 5000_2.fastq.gz 5000.d05.l40_2.fastq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 | "gzip > random.5000.d05.l40_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random.5000.d05.l40_2.fq.gz"}'

# symlinks with handier names
for f in *.fq.gz; do ln -s $f `echo $f | sed 's/random.//;s/.d05.l40//'`; done

# mate-pairs contaminated with paired-end reads
#for i in 1 2; do zcat <(zcat test/5000_$i.fq.gz test/5000_$i.fq.gz | fastx2reverse_complement.py fastq | gzip) test/600_$i.fq.gz > test/mixed.$i.fq.gz; done

De novo genome assembly

Simulated reads were assembled with SPAdes v.3.5.0.

dipspades.py --only-assembler -t 4 -o 600 -1 600_1.fq.gz -2 600_2.fq.gz

Redundans pipeline

Finally, heterozygous genome assembly pipeline can be applied to simulated heterozygous contigs and paired/mate-pairs reads.

./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1 

Run statistics

De novo assembly

This is optional step performed only if no contigs (-f) are given. Redundans uses Platanus for contigs assembly, scaffolding and gap closing. In first order, Redundans uses libraries with reads between 70-150 bp. If no such libraries are available, libraries with other read lenghts are used, but only libraries with cummulative sequence length of at least 2/3 of the largest library will be used. Note, assembly is be performed in single-end mode and only resulting contigs (not scaffolds) are going to be used.
Subsequently, library statistics are estimated and if FR paired-end libraries are detected, Redundans will proceed with scaffolding and gap closing of de novo contigs using Platanus. This shouldn't be mixed with Redundans scaffolding and gap closing modules - those are using different software and set of parameters.

Parameters estimation

At the beginning, Redundans estimates number of parameters:

  • number of reads that it's going to align (based on -l/--limit parameter)
 Aligning 32779 mates per library...
  • library statistics:
  • insert size: median, mean, stdev
  • mates orientation: FF, FR, RF, RR
Table 1: Library statistics
FastQ files read length median mean stdev FF FR RF RR
test/5000_1.fq.gz test/5000_2.fq.gz 50 4998 4990.20 721.47 0 4,674 0 0
test/600_1.fq.gz test/600_2.fq.gz 100 599 598.63 47.68 0 10,000 0 0

Above, you can see, there are two libraries:

  • test/5000_1.fq.gz & test/5000_2.fq.gz:
    • FR orientation
    • mate-pairs with 5000 bp insert size +/- 600 bp
  • test/600_1.fq.gz & test/600_2.fq.gz:
    • FR orientation
    • paired-end with 600 bp insert size +/- 39 bp

Based on these statistics, the program will perform scaffolding and gap closing starting from libraries with the smallest insert size.
Also at this stage, the user is notified about libraries with poor statistics i.e. large stdev or not consisted orientation. For these cases, the statistics are updated before every scaffolding iteration. This is done, because for libraries with large insert size, the estimation of insert size may be highly affected by the fragmentation of the assembly.

Reduction

In the first step, Redundans will recognise and selectively remove heterozygous contigs from initial assembly. The details about all removed heterozygous contigs are stored in tab-delimited file contigs.reduced.fa.hetero.tsv. Additionally, the histogram with identities of reduced contigs are saved as contigs.reduced.fa.hist.png.

At the end of this step, the user will be notified about:

  • initial assembly size and fragmentation
  • what portion of the the initial assembly have been recognised as heterozygous
  • identity between heterozygous regions
  • size and fragmentation of reduced assembly
Table 2: Reduction statistics
file name genome size contigs heterozygous size [%] heterozygous contigs [%] identity [%] possible joins homozygous size [%] homozygous contigs [%]
test/run6/contigs.fa 163897 245 76363 46.59 222 90.61 94.837 0 97319 59.38 23 9.39

Above, you can see that:

  • initial SPAdes assembly was fragmented (245 contigs) and larger (163.8 Kb) than reference sequence (100 Kb)
  • Redundans:
  • removed 220 heterozygous contigs, resulting in reduced assembly of 97.8 Kb (very similar to reference) in 28 contigs
  • estimated 94.837% identity between heterozygous regions, which is very similar to the divergence that was used for simulations (5%)
Scaffolding

Redundans repeat scaffolding on each library several times (--iters). For each scaffolding iteration, the program reports number of:

  • processed pairs
  • pairs passing alignment quality
  • pairs that aligned in different contigs
 iteration 1.1 ...
   19572 pairs. 18505 passed filtering [94.55%]. 1993 in different contigs [10.18%].
 iteration 1.2 ...
   19572 pairs. 18569 passed filtering [94.88%]. 340 in different contigs [1.74%].
 iteration 2.1 ...
   19572 pairs. 17009 passed filtering [86.90%]. 755 in different contigs [3.86%].
  closing gaps ...
 iteration 2.2 ...
   19572 pairs. 17261 passed filtering [88.19%]. 0 in different contigs [0.00%].
  closing gaps ...

Above, you can see that in iteration 1.1 above 10% of reads aligned in different contigs, while in subsequent iteration, less than 2% reads. This informs that in general it's enough to use 1-2 iterations of scaffolding per library.

Gap closing

For the time-being, only the iteration number is reported.

 iteration 1.1 ...
 iteration 1.2 ...
 iteration 2.1 ...
 iteration 2.2 ...
Summary statistics

At the end, the program reports details of each step:

  • file name
  • fragmentation and size of the assembly at this stage
  • %GC content
  • fragmentation and size of assembly in contigs >1 Kb
  • N50 & N90
  • cumulative size of gaps
  • the size of the longest contig
fname contigs bases GC [%] contigs >1kb bases in contigs >1kb N50 N90 Ns longest
test/run6/contigs.fa 245 163897 40.298 24 117391 3975 233 0 29603
test/run6/contigs.reduced.fa 23 97319 39.333 17 94157 7321 2195 0 29603
test/run6/_sspace.1.1.fa 4 97242 39.315 3 96924 87727 87727 473 87727
test/run6/_sspace.1.2.fa 3 97321 39.315 2 97003 87727 87727 552 87727
test/run6/_sspace.2.1.fa 1 100280 39.315 1 100280 100280 100280 3511 100280
test/run6/_sspace.2.1.filled.fa 1 100251 39.363 1 100251 100251 100251 2782 100251
test/run6/_sspace.2.2.fa 1 100251 39.363 1 100251 100251 100251 2782 100251
test/run6/_sspace.2.2.filled.fa 1 100123 39.360 1 100123 100123 100123 2677 100123
test/run6/scaffolds.fa 1 100123 39.360 1 100123 100123 100123 2677 100123
test/run6/_gapcloser.1.1.fa 1 100495 39.542 1 100495 100495 100495 1265 100495
test/run6/_gapcloser.1.2.fa 1 100498 39.579 1 100498 100498 100498 975 100498
test/run6/_gapcloser.2.1.fa 1 100502 39.679 1 100502 100502 100502 4 100502
test/run6/_gapcloser.2.2.fa 1 100506 39.679 1 100506 100506 100506 4 100506
test/run6/scaffolds.filled.fa 1 100506 39.679 1 100506 100506 100506 4 100506

Accuracy estimation

Accuracy of recovered contigs can be assessed by alignment of final scaffolds (run1/scaffolds.filled.fa) back onto reference (ref.fa).

NUCMER

This can be quickly achieved with nucmer (works for small genomes):

cd run1
# align
nucmer -p nucmer.ref ../ref.fa scaffolds.filled.fa
# generate pairwise plot
mummerplot --png --large nucmer.ref.delta -p nucmer.ref.plot
# open plot
eog nucmer.ref.plot.png

Figure 1: Pairwise alignment of reference sequence and redundans scaffolds

You may want to compare final redundans scaffolds (run1/scaffolds.filled.fa) with initial SPAdes contigs (contigs.fa).

cd run1
# align
nucmer -p nucmer.contigs scaffolds.filled.fa ../contigs.fa
# generate pairwise plot
mummerplot --png --large nucmer.contigs.delta -p nucmer.contigs.plot
# open plot
eog nucmer.contigs.plot.png

Figure 2: Pairwise alignment of redundans scaffolds and SPAdes contigs

Finally, comparison of consensus contigs from dipSPAdes (consensus_contigs.fa) and reference sequnece (ref.fa) informs about missing regions in dipSPAdes reconstruction.

# align
nucmer -p nucmer.ref_dipspades ref.fa consensus_contigs.fa
# generate pairwise plot
mummerplot --png --large nucmer.ref_dipspades.delta -p nucmer.ref_dipspades.plot
# open plot
eog nucmer.ref_dipspades.plot.png

Figure 3: Pairwise alignment of reference sequence and dipSPAdes consensus contigs

LAST

The same can be done using LAST (this will work also for large genomes):

# index reference genome (need to be done only once per each reference)
lastdb ../ref.fa ../ref.fa

# align & generate dotplot on the fly
lastal -f TAB ../ref.fa scaffolds.filled.fa | last-dotplot - scaffolds.filled.fa.png

# open plot
eog scaffolds.filled.fa.png

Figure 4: Pairwise alignment of reference sequence and Redundans contigs

All programs/scripts not disclosed in this repository (i.e. fasta2diverged.py), can be found in https://github.com/lpryszcz/bin.