Skip to content

A simple tool to generate hierarchical clustering trees from nucleotide sequences. Supports a number of distance metrics and clustering algorithms. Includes a large testset of SARSCOV2 genomes.

License

ArthurVM/TreeMer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TreeMer

A simple tool to generate hierarchical clustering trees from nucleotide sequences using kmer spectra distance. Included is a small testset of SARSCOV2 genomes downloaded from https://www.nlm.nih.gov/news/coronavirus_genbank.html.

Overview

This tool calculates the distance between a set of nucleotide sequences in FASTA format by digesting them into kmer count vectors (effectively kmer spectra). The pairwise distance between all pairs of vectors are calculated and clustered to build a Hierarchical clustering tree. A number of distance metrics and clustering methods are supported (see distance and clustering).

Installation

Installation is very straightforward, simply run

git clone git@github.com:ArthurVM/TreeMer.git
cd TreeMer
python3 -m pip install -d dependencies.txt

and you are good to go!

Input

TreeMer takes kmer a set of nucleotide sequences in FASTA format, and generates kmer count files, stuctured as:

kmer0 count
kmer1 count
...
kmern count

in tab seperated format (denoting the kmer spectrum of the sequence). These kmer spectra are used to distance vector, and a Hierarchical Clustering tree generated.

Output

TreeMer outputs the following files:

HC_dendro.png     - The hierarchical clustering dendrogram in .png format.
HC_tree.nwk       - A text file containing the hierarchical clustering tree in Newick format. 
heatmap.png       - The heatmap of sequence distances in .png format.
heatmap.{D}.tsv   - A heatmap file in .tsv format. {D} is the distance metric used. 

Usage

usage: TreeMer.py [-h] [-i I I] [-k K] [-m M] [-s]
                  [-d {distance metric}}]
                  [-c {clustering method}]
                  [-g G]
                  [fa_files [fa_files ...]]

positional arguments:
  fa_files              An arbitrary number of sequence files in FASTA format.

optional arguments:
  -h, --help            show this help message and exit
  -i I I                Lower and upper bound percentiles to construct the
                        tree. E.g. 25 75 will generate a tree from kmers from
                        the 25th to the 75th percentiles in the total set of
                        kmers ordered by count.
  -k K                  Kmer size to use in constructing genome comparison.
                        Default=7.
  -m M                  The maximum count to return a kmer, e.g. return only
                        kmers with count <=10 if m=10. Default=return ALL.
  -s                    Suppress the generation of kmer-spectra from sequence
                        files. This assumes that all positional arguments
                        provided to this tool are already kmer-spectra files
                        generated by genKmerCount. Default=False.
  -d {euclidean,minkowski,cityblock,sqeuclidean,hamming,jaccard,chebyshev,canberra,braycurtis,yule}
                        Metric used in calculating distance between kmer
                        spectra. Default=euclidean.
  -c {ward,single,complete,average,weighted,centroid,median}
                        Clustering method utilised to build the tree.
                        Default=ward.
  -g G                  A tab seperated text file containing geographic
                        locations for each sequence, ith the sequence ID in
                        col0 an geolocation in col1. Default=False.
  -v                    Verbose output mode. Default=False.

Example Using SARSCOV2 Dataset

A dataset of complete SARSCOV2 genomes are provided with this tool, in the /TreeMer/SARSCOV2/SARSCOV2_WGS directory. This includes geolocations of each isolate in /TreeMer/SARSCOV2/geolocs.tsv.

The entire pipeline can be run using a single command fromthe TreeMer root directory:

python3 TreeMer.py SARSCOV2/SARSCOV2_WGS/* -k 7 -i 10 90 -d euclidean -c ward -g SARSCOV2/geolocs.tsv

In this instance, we are calculating the euclidean distance between 7mer frequency vectors, stripping out the 10% least and most frequent kmers, and clustered using Wards method. The subsiquent tree is: Euclidean HC Ward clustering dendrogram SARSCOV2

Distance and Clustering

A number of distance metrics and clustering methods are supported by this tool.

Distance Metrics

  • Euclidean
  • Minkowski
  • Cityblock
  • Sqeuclidean
  • Hamming
  • Jaccard
  • Chebyshev
  • Canberra
  • Bradycurtis
  • Yule

Clustering Methods

  • Ward
  • Single
  • Complete
  • Average
  • Weighted
  • Centroid
  • Median

Dependencies

python3
argparse
scipy
numpy
matplotlib
seaborn

About

A simple tool to generate hierarchical clustering trees from nucleotide sequences. Supports a number of distance metrics and clustering algorithms. Includes a large testset of SARSCOV2 genomes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published