TideHunter: efficient and sensitive tandem repeat detection from noisy long reads using seed-and-chain

Updates (v1.4.3)

Update to the latest abPOA

Getting started

wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3.tar.gz
tar -zxvf TideHunter-v1.4.3.tar.gz && cd TideHunter-v1.4.3

Make from source and run with test data:

make; ./bin/TideHunter ./test_data/test_50x4.fa > cons.fa

Or, install via conda and run with test data:

conda install -c bioconda tidehunter
TideHunter ./test_data/test_50x4.fa > cons.fa

Introduction

TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).

It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and does not have any limitation of the maximal repeat pattern size.

Installation

Installing TideHunter via conda

On Linux/Unix and Mac OS, TideHunter can be installed via

conda install -c bioconda tidehunter

Building TideHunter from source files

You can also build TideHunter from source files. Make sure you have gcc (>=6.4.0) and zlib installed before compiling. It is recommended to download the latest release of TideHunter from the release page.

wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3.tar.gz
tar -zxvf TideHunter-v1.4.3.tar.gz
cd TideHunter-v1.4.3; make

Or, you can use git clone command to download the source code. Don't forget to include the --recursive to download the codes of abPOA. This gives you the latest version of TideHunter, which might be still under development.

git clone --recursive https://github.com/yangao07/TideHunter.git
cd TideHunter; make

Pre-built binary executable file for Linux/Unix

If you meet any compiling issue, please try the pre-built binary file:

wget https://github.com/yangao07/TideHunter/releases/download/v1.4.3/TideHunter-v1.4.3_x64-linux.tar.gz
tar -zxvf TideHunter-v1.4.3_x64-linux.tar.gz

Getting started with toy example in `test_data`

TideHunter ./test_data/test_1000x10.fa > cons.fa

Usage

To generate consensus sequences in FASTA format

TideHunter ./test_data/test_1000x10.fa > cons.fa

To generate consensus sequences in tabular format

TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out

To generate full-length consensus sequences

TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa

To generate unit sequences in FASTA format

TideHunter -u ./test_data/test_1000x10.fa > unit.fa

To generate unit sequences in tabular format

TideHunter -u -f 2 ./test_data/test_1000x10.fa > unit.out

Commands and options

Options: 
  Seeding:
    -k --kmer-length INT    k-mer length (no larger than 16). [8]
    -w --window-size INT    window size. [1]
    -s --step-size   INT    step size. [1]
    -H --HPC-kmer           use homopolymer-compressed k-mer. [False]
  Tandem repeat criteria:
    -c --min-copy    INT    minimum copy number of tandem repeat. [2]
    -e --max-diverg  INT    maximum allowed divergence rate between two consecutive repeats. [0.25]
    -p --min-period  INT    minimum period size of tandem repeat. (>=2) [30]
    -P --max-period  INT    maximum period size of tandem repeat. (<=4294967295) [10K]
  Scoring parameters for partial order alignment:
    -M --match    INT       match score [2]
    -X --mismatch INT       mismatch penalty [4]
    -O --gap-open INT(,INT) gap opening penalty (O1,O2) [4,24]
    -E --gap-ext  INT(,INT) gap extension penalty (E1,E2) [2,1]
                            TideHunter provides three gap penalty modes, cost of a g-long gap:
                            - convex (default): min{O1+g*E1, O2+g*E2}
                            - affine (set O2 as 0): O1+g*E1
                            - linear (set O1 as 0): g*E1
  Adapter sequence:
    -5 --five-prime  STR    5' adapter sequence (sense strand). [NULL]
    -3 --three-prime STR    3' adapter sequence (anti-sense strand). [NULL]
    -a --ada-mat-rat FLT    minimum match ratio of adapter sequence. [0.80]
  Output:
    -o --output      STR    output file. [stdout]
    -u --unit-seq           only output the unit sequences of each tandem repeat, no consensus sequence. [False]
    -l --longest            only output the consensus sequence of the tandem repeat that covers the longest read sequence. [False]
    -F --full-len           only output full-length consensus sequence. [False]
    -f --out-fmt     INT    output format. [1]
                            - 1: FASTA
                            - 2: Tabular
  Computing resource:
    -t --thread      INT    number of threads to use. [4]

  General options:
    -h --help               print this help usage information.
    -v --version            show version number.

Input

TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.

Adapter sequence

Additional adapter sequence files can be provided to TideHunter with -5 and -3 options.

TideHunter uses adapter information to search for the full-length sequence from the generated consensus.

Once two adapters are found, TideHunter trims and reorients the consensus sequence.

Output

TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.

Tabular format

For tabular format, 9 columns will be generated for each consensus sequence:

No.	Column name	Explanation
1	readName	the original read name
2	repN	`N` is the ID number of the tandem repeat, within each read, starts from 0
3	readLen	length of the original long read
4	start	start coordinate of the tandem repeat, 1-based
5	end	end coordinate of the tandem repeat, 1-based
6	consLen	length of the consensus sequence
7	copyNum	copy number of the tandem repeat
8	aveMatch	average percent of matches between each unit sequence and the consensus sequence (# matched bases / unit length)
9	fullLen	0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length
10	subPos	start coordinates of all the tandem repeat unit sequence, followed by the end coordinate of the last tandem repeat unit sequence, separated by `,`, all coordinates are 1-based, see examples below
11	consSeq	consensus sequence

For example, here are the output for a non-full-length consensus sequence generated from test_data/test_50x4.fa and the adiagram that illustrates all the coordiantes in the output:

test_50x4 rep0  300 51  250 50  4.0 100.0 0 59,109,159,208  CGATCGATCGGCATGCATGCATGCTAGTCGATGCATCGGGATCAGCTAGT

In this example, TideHunter identifies three consecutive tandem repeat units, [59,108], [109,158], [159,208], from the raw read which is 300 bp long. A consensus sequence with 50 bp is generated from the three repeat units. TideHunter further extends the tandem repeat boundary to [51, 250] by aligning the consensus sequence back to the raw read on both sides of the three repeat units.

Another example of the output for a full-length consensus sequence generated from test_data/full_length.fa:

8f2f7766-4b8e-4c0d-9e2b-caf0e5527b19  rep0  5231  31  5215  203 8.8 95.7  1 207,798,1386,1976,2563,3155,3746,4333,4930  ACTAATAAGATCAACAGAATCAGAGTAGATAGTTCCTTGATCGGAACCAAAGGACCCCGTGCCTCAATCTCTATCCTGATGTCATGGGAGTCCTAGCAAAGCTATAGACTCAAGCAAGGCTTGGGGTCCTTTATGGAACCCAAGGATGACTCAGCAATAAAATATTTTGGTTTTGGTTTATAAAAAAAAAAAAAAAAAAAAAA

In this example, the consLen (i.e., 203) is the length of the full-length consensus sequence excluding the 5' and 3' adapter sequences and the subPos (i.e., 207,798,1386,1976,2563,3155,3746,4333,4930) contains the coordinate information of the identified tandem repeat units.

FASTA format

For FASTA output format, the read name contains detailed information of the detected tandem repeat, i.e., the above columns 1 ~ 10. The sequence is the consensus sequence.

The read name of each consensus sequence has the following format:

>readName_repN_readLen_start_end_consLen_copyNum_aveMatch_fullLen_subPos

Unit sequences

TideHunter can output the unit sequences without performing the consensus calling step when option -u/--unit-seq is enabled. Then, only the following information will be output for the tabular format:

No.	Column name	Explanation
1	readName	the original read name
2	repN	`N` is the ID number of the tandem repeat, within each read, starts from 0
3	unitX	`X` is the ID number of the unit sequence, starts from 0
4	unitSeq	unit sequence

And for the FASTA format:

>readName_repN_unitX
unitSeq X
>readName_repN_unitY
unitSeq Y

Contact

Yan Gao yangao07@hit.edu.cn

Yadong Wang ydwang@hit.edu.cn

Yi Xing XINGYI@email.chop.edu

github issues

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
abPOA @ 4dc66a3		abPOA @ 4dc66a3
edlib		edlib
ksw2		ksw2
src		src
test_data		test_data
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example_50x4.png		example_50x4.png

License

hitbc/TideHunter

Folders and files

Latest commit

History

Repository files navigation