Introduction

This project includes simulated and 'golden standard' datasets, software and scripts that we used to benchmark error correction tools in our study : Comprehensive benchmarking of error correction methods for next generation sequencing via unique molecular identifiers.

Golden standard data

We have used simulated and experimentally obtained ‘golden standard’ data from human and Human Immunodeficiency Virus (HIV) virus. We used datasets threee experimentally obtained datasets (T1, T2, T3) and three simulated datasets (S1, S2, S3) to becnmark error corretions algorithms.

T1: reads derived from human genomic DNA data

T1a : genome in a bottle data

We use genome in a bottle data (GIAB, http://jimb.stanford.edu/giab/). The data was derived from

composed of high-quality variant calls

The raw data is here

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/

Some discussion here https://www.biostars.org/p/239598/

Paragraph from the paper

These data and other data sets for NA12878 are available at the Genome in a Bottle ftp site at NCBI (ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878) and are described on a spreadsheet at http://genomeinabottle.org/blog-entry/existing-and-future-na12878-datasets. In addition, the results of this work (high-confidence variant calls and BED files describing confident regions) are available at ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/variant_calls/NIST along with a README.NIST describing the files and how to use them. T

We downloaded raw paired-end reads from http://www.ebi.ac.uk/ena/data/view/PRJNA162355. Ground true (high-quality variant calls) was downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/

T1b : 100G WGS

We use higly covered positions of WXS to define higly confident calls and use those to estimate error rate in corresponding samples obtained by WGS.

We downloaded the 30x downsampled NA12878 BAM file from https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/alignment.index.NA12878_HiSeq_downsampled30X_GRCh37_10262015

This Python script was used to split the BAM file into two files randomly: one containing 90% of the reads and one containing 10% https://gist.github.com/brianhill11/7aeeeb6d94edfb868e5595aac04a0dd6

S1a

We use combination of real data and the clonality model to generate cancer data. We have use a normal WGS sample and split the reads into 2 population. For example 90%/10%, 95%/5% In each of the population we introduce in-silico variants by using https://github.com/adamewing/bamsurgeon with -s option preventing the conflict with the real variants from the data.

The variants for each of the populations were randomly generated at 1SNP per 1000 bases. We do this 5 times.

Reads derived from human genomic DNA data (T2,S2)

We have also used publically available TCR-Seq data with attached 12bp UMIs from 10 chronic HIV patients (SRP045430).

Data obtaine from : Best, Katharine, et al. "Dynamic perturbations of the T-cell receptor repertoire in chronic HIV infection and following antiretroviral therapy." Frontiers in immunology 6 (2016): 644.

Reads derived from human genomic DNA data (in-house) (T3,S3)

We used the UMI-based high-fidelity sequencing protocol (also known as safe-SeqS) to eliminate errors from the sequencing data. Full description of high-fidelity sequencing protocol is provided in Mangul, Serghei, et al. "Accurate viral population assembly from ultra-deep sequencing data." Bioinformatics 30.12 (2014): i329-i337.

We have used in-house sequencing data derived from 3.4 kb region of Human Immunodeficiency Virus (HIV) spanning the Gag/Pol genes. The data consist of 107 millions 2x100bp reads with attached 13bp UMIs. Applying high-fidelity protocol resulted in 3.1 million reads used to evaluate error correction algorithms (Golden Dataset 1: GD1).

Information about the error correction tools included in the benchmarking study.

For each of the tools we provide schell script with instlation commands. Instalation scripts are available at

Table S1. Information about the error correction tools included in the benchmarking study.

Name	Version	Underlying algorithm	Data structure	Types of reads accepted (single-end(SE) or pairen-end(PE))	Organism	Journal	Published year	Programming language	In the publication compared to	Tools webpage	Software Dependencies	Default k-mer size	Read trimming by default
BLESS	1.02	k -mer spectrum	Bloom filter and hash table	SE and PE	Human, E. Coli, Staphylococcus aureus	Oxford Bioinformatics	2014	C++	SGA, QuorUM, Lighter, BFC, DecGPU, ECHO, HiTEC, Musket, Quake, Reptile	https://sourceforge.net/p/bless-ec/wiki/Home/	MPICH 3.1.3, OpenMPI 1.8.4, Boost library, google spareshash, klib, KMC, murmurhash3, zlib, pigz	N/A	YES
Fiona	0.2.8	Approx. MSA	partial suffix array	SE	human, drosophila, bacteria, C. elegans	Bioinformatics	2014	C++	Allpaths-LG,Coral,H-Shrec,ECHO,HiTEC,Quake	https://github.com/seqan/seqan/tree/master/apps/fiona	N/A	N/A	NO
Pollux	1.0.2	k -mer spectrum	Hash table,	SE, PE	human, bacteria	BMC Bioinformatics	2015	C	Quake, SGA, BLESS, Musket, RACER	https://github.com/emarinier/pollux	64 bit Unix-based OS,	31k	YES
BFC	1.0	k -mer spectrum	Bloom filter and hash table	SE, PE (interleaved)	Human,C. Elegans	Bioinformatics	2015	C	BLESS-v0p23 (Heo etal., 2014), Bloocoo-1.0.4 (Drezen etal., 2014), fermi2-r175 (Li, 2012), Lighter-20150123 (Song etal., 2014), Musket-1.1 (Liu etal., 2013) and SGA-0.9.13 (Simpson and Durbin, 2012)	https://github.com/lh3/bfc	N/A	Depends on input genome size	NO
Lighter	1.1.1	k -mer spectrum	Bloom filter	SE and PE	Human, E. Coli, C. elegan	Genome Biology	2014	C++	Quake v0.3, Musket v1.1, Bless v0p17, Soapec v2.0.1	https://github.com/mourisl/Lighter	N/A	N/A	NO
Musket	1.1	k-mer spectrum	Bloom filter and hash table	SE and PE	Human, E. Coli, C. elegans	Oxford Bioinformatics	2012	C++	SGA, Quake	http://musket.sourceforge.net/homepage.htm	N/A	21	NO
Racer	1.0.1	k-mer spectrum	Hash table	SE and PE	Human, bacteria, virus, C. elegans, Drosophila	Bioinformatics	2013	C++	Coral, HITEC, Quake, Reptile, SHREC	https://github.com/lucian-ilie/RACER	OpenMP	N/A	NO
Reptile	1.1	k-mer spectrum	Hamming graph	Doesn't say	Human, Acinetobacter sp., E. Coli	Bioinformatics	2010	C++	SHREC	http://aluru-sun.ece.iastate.edu/doku.php?id=reptile	Perl, GNU make, C++ compiler	24	NO
Quake	0.3	k-mer spectrum	Bit array index	SE or PE	Human, E. Coli	Genome Biology	2010	C++, R	SOAPdenovo,EULER, SHREC	http://www.cbcb.umd.edu/software/quake	N/A	15	YES
SOAPdenovo2 Corrector	2.03	k-mer spectrum	Hash table	SE, PE	Human, PhiX174, Drosophilla, Saccharomyces cerevisiae	GigaScience	2012	C/C++	SOAPdevnovo1, ALLPATHS-LG	http://soap.genomics.org.cn/about.html	GCC 4.4.5 or later	N/A
ECHO	1.12	MSA*	Hash table	SE	Human,	Genome Research	2012	Python	SA, SHREC	http://uc-echo.sourceforge.net/	GCC 4.1 or later, Python 2.6, numpy, scipy	1/6 (length of read)	YES
Coral	1.4.1	MSA*	Hash table	SE, PE (interleaved)	No Organism	Bioinformatics	2011	C	Shrec, Quake and Reptile	https://www.cs.helsinki.fi/u/lmsalmel/coral/	N/A	N/A	YES
RECKONER	1.1	k-mer spectrum	Hash table	Doesn't say	S. cerevisiae, C. elegans, M. acuminata	Bioinformatics	2017	C++	Ace 1.01, BFC-ht v1, BLESS 0.24, Blue 1.1.2, Karect 1.0, Lighter 1.0.4, Musket 1.1, Pollux 1.00, RACER 1.0.1, Trowel 0.1.4.3	https://github.com/refresh-bio/RECKONER	KMC3, KMC tools	N/A	NO
SGA	0.10.15	FM-index search	FM-index	SE and PE	C. elegans, E. coli, Human	Genome Research	2012	C++	Velvet, ABySS, SOAPdenovo, Quake, HiTEC	https://github.com/jts/sga	Google sparse hash library, bamtools, zlib, jemalloc (optional), pysam, ruffus	31	NO
ShoRAH	1.1.0	clustering	not specified	SE	RNA viruses	BMC Bioinformatics	2011	C++, Python, Perl	no comparison	https://github.com/cbg-ethz/shorah	Biopython, NumPy, Perl, zlib, pkg-config, GNU scientific library	N/A	YES
KEC	1.0	k -mer spectrum	Hash table	SE	RNA viruses	BMC Bioinformatics	2012	Java	ShoRAH	http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm	FAMS; ClustalW2 or Muscle (optional)	25	NO

MSA - multiple sequence alignment
Command for checking whether a tool trims reads:

cd /u/home/n/ngcrawfo/scratch/ERROR_CORRECTION/simulated_reads_ec/results
for tool in `ls`; do echo -n "${tool} "; f=`ls ${tool}/IGH/splitted/sim_rl_100_cov_128/ | head -1`; n=`cat ${tool}/IGH/splitted/sim_rl_100_cov_128/${f} | awk 'NR % 4 == 2 {print length($0)}' | sort | uniq | wc -l`; if [[ $n -eq 1 ]]; then echo "NO"; else echo "YES"; fi; done

How to run error correction tools

Quake

To install:

# Edit the Makefile to include the location of where the boost library is installed
sed -i "s#-I/opt/local/var/macports/software/boost/1.46.1_0/opt/local/include#-I/usr/include/boost#" Quake/src/Makefile
make

To run:

cat <fastq file> | Quake/bin/count-kmers -k <kmer-length> > counts.txt
Quake/bin/cov_model --int counts.txt
Quake/bin/correct -r <fastq file> -k <k-mer length> -m counts.txt -a cutoff.txt

SOAPdenovo Corrector

To install:

make

To run:

# If k-mer size <= 17:
./KmerFreq_AR -k <k-mer length> -p <output prefix> <list of read files>
./Corrector_AR -k <k-mer length> <prefix.freq.cz> <prefix.freq.cz.len> <list of read files>

# If k-mer size > 17:
./KmerFreq_HA -k <k-mer length> -p <output prefix> -L <maximum read length> -l <list of read files>
./Corrector_HA -k <k-mer length> <prefix.freq.cz> <list of read files>

Coral

To install:

make

To run:

./coral [-f, -fq or -fs for input file format] <input file> -o <output file>

RECKONER

To install:

make

To run:

reckoner -memory <8G> -prefix <output directory> -kmerlength <K> <list of fastq files>

SGA

To install:

./autogen.sh
./configure && make && make install

To run:

sga preprocess <input fastq> > <output fastq>.preprocessed.fastq
sga index -a ropebwt <output fastq>.preprocessed.fastq
sga correct -k <K> -o <output fastq>.out.fastq <output fastq>.preprocessed.fastq

Preparing golden true datasets

T3

Before error correction I deleted barcode part from each original read (13 bp in the beginning of the read ). The length became 87bp.

Virus-specific tools

We run 2 virus-specific tools able to correct errors in the reads generated from viral population and reconstruct local haplotypes

We run KEC and Shorah

KEC

As input I mixed forward and reverse reads together. KEC can not work with paired end reads. KEC changed original reads names and collapsed some reads.

Shorah

Shohar is alligment based tool. We maped the read sonto the viral reference genome prior to running tool

kmer size selection

Error correction tools require k-mer size as a parameter. We use the following formula to caclulate the k-mer size using the length of the genome : l= log4 200*G, where G is the genome size

T1 and S1

k=19 genome size=3 000 000 000

T2

k=18 genome size=405000 log4(405*1000)X2=18

S2

k=18 genome size=405000 log4(405*1000)X2=18

T3

k=15 genome size=3400

Since we have relatively long read lengths of 88bp and relatively high average read coverage levels of up to 100x, we chose a larger k-mer size of 15.

More tools to install

https://www.ncbi.nlm.nih.gov/pubmed/26678663

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
example		example
scripts		scripts
README.md		README.md
_config.yml		_config.yml
datasets.md		datasets.md

smangul1/ec

Folders and files

Latest commit

History

Repository files navigation

Introduction

Golden standard data

T1: reads derived from human genomic DNA data

T1a : genome in a bottle data

T1b : 100G WGS

S1a

Reads derived from human genomic DNA data (T2,S2)

Reads derived from human genomic DNA data (in-house) (T3,S3)

Information about the error correction tools included in the benchmarking study.

How to run error correction tools

Bless

To install:

To run:

Fiona

To install:

To run:

ECHO

To install:

To run:

Pollux

To install:

To run

BFC

To install:

To run:

Lighter

To install:

To run:

Musket

To install:

To run:

RACER

To install:

To run:

Reptile

To install:

To run:

Quake

To install:

To run:

SOAPdenovo Corrector

To install:

To run:

Coral

To install:

To run:

RECKONER

To install:

To run:

SGA

To install:

To run:

Preparing golden true datasets

T3

Virus-specific tools

KEC

Shorah

kmer size selection

T1 and S1

T2

S2

T3

More tools to install

About

Topics

Resources

Stars

Watchers

Forks

Languages