HeatSeq™ uses hierarchical clustering to generate distance based cluster predictions for genomic (ANI, AAI) and metagenomic (beta distance, MASH, Simka) data, or any other all vs. all distance type matrix, and it builds a clustered heatmap with column labels to highlight predicted cluster assignments and other user provided metadata. It outputs a seaborn clustermap in vector based PDF format along with separate PDF legend corresponding to each metadata row. It also outputs a metadata tab separated value (tsv) file with the predicted cluster assignments for each cluster threshold for each genome, metagenome, or sample along with a separate tsv file specifying the color assignments. These output files may be modified by the user and input back into the code to customize the final figure output.

The genome/sample order in the predicted_clusters.tsv file matches the order of the clustermap. The outputs also include a meta_colors.tsv file where a color has been assigned to each predicted cluster for each distance threshold, and a distmat.tsv which is the distance matrix reordered to match the clustermap.

genomes	clade-0 (96.0% ANI; t=0.04)	clade-1 (97.4% ANI; t=0.026)	clade-2 (98.0% ANI; t=0.02)	clade-3 (98.7% ANI; t=0.013)	clade-4 (99.2% ANI; t=0.008)	clade-5 (99.8% ANI; t=0.002)
gnm_1383800v1	c0-006	c1-007	c2-008	c3-013	c4-032	c5-039
gnm_1389944v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1380692v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1380608v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1380652v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1388647v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1388747v1	c0-006	c1-007	c2-008	c3-019	c4-048	c5-059
gnm_1383334v1	c0-006	c1-007	c2-008	c3-024	c4-053	c5-066
gnm_1388961v1	c0-006	c1-007	c2-008	c3-024	c4-053	c5-066
gnm_1389954v1	c0-006	c1-007	c2-008	c3-024	c4-053	c5-066
gnm_1389962v1	c0-006	c1-007	c2-008	c3-024	c4-053	c5-066
gnm_1389099v1	c0-006	c1-007	c2-008	c3-020	c4-049	c5-061
gnm_1383380v1	c0-006	c1-007	c2-008	c3-020	c4-049	c5-060

Additional output files (see tests folder for actual files)

distmat.tsv - square data matrix sorted according to clustered heatmap
meta_colors.tsv - colors corresponding to unique values in predicted_clusters.tsv file
Legend_clade-x.pdf - Legends corresponding to the predicted clades. These can be added to the figure using Adobe Illustrator or other.

Usage

Their are three initial options:

Predict clusters for local minimumns in the distance distribution (default).
Predict clusters for user defined values (-user). e.g. -user 95 97.5 98.5 99.5
Skip cluster prediction and use user defined metadata. e.g. -m meta_data.tsv -c meta_colors.tsv

And several additional parameters:

-dtype, Specify data type. One of fastANI (default), ANI, AAI, Mash, Simka, Distance
1. fastANI (default) - this takes the longform fastANI output and parses it into a distance matrix.
2. ANI - this is for a square all vs. all matrix of ANI values with 100's down the diagonal.
3. AAI - this is for a square all vs. all matrix of AAI values with 100's down the diagonal.
4. Mash - this is for a Mash distance matrix.
5. Simka - this is for a Simka distance matrix.
6. Distance - this is any square distance matrix with values 0-1.
-dmin, Specify the minimum value for colors in the figure.
-dmax, Specify the maximum value for colors in the figure.
-metric, set different cluster metrics. choose from:
1. euclidean, seuclidean, correlation, hamming, jaccard, braycurtis
2. See webpage/docs for scipy.spatial.distance.pdist for additional options and more info.
-method, set different cluster method choose from:
1. single, complete, average, weighted, centroid, median, ward
2. See webpage/docs for scipy.cluster.hierarchy.linkage for more info.
-no, Create clustermap of the data without cluster prediction or metadata.
-bw, Adjust the KDE bandwidth parameter to control smoothing and valley prediction.
-l, Add axis labels to plot. Default is no labels because they are typically difficult to read.
-hcg, Specify the number of colors to use in the heatmap color gradient (2-11). If you would like to customize the colors, you can edit the simple color list in HeatSeq.py on lines 273-277 with any colors you'd like.

# print the help menu
python HeatSeq.py -h

Case One: Default

The default case uses scipy.stats.gaussian_kde and scipy.signal.find_peaks to estimate local minimums and maximums in the distance distribution. The local minimums are used as distance thresholds during the hierarchical clustering process.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/01_example_default

Note the bandwidth affects the smoothing in the KDE estimate which in turn influences the local minimum (valley) estimates. You can use the -bw parameter to experiment

KDE with peaks (local maximums) and valleys (local minimums)

Case Two: User Defined

The user defined case allows the local minimum estimates to be amended, appended, or changed entirely based on pure curiosity and speculation. In the example below we amended and appended the local minimums that were identified in case one using some intra-species level distance thresholds.

Use -user parameter to input desired thresholds

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/02_example_default -user 96 97.4 98 98.7 99.2 99.8

Case Three: Custom Metadata

The custom metadata case allows the user to modify the predicted_clusters.tsv and the meta_colors.tsv files output from the case one or two outputs to include additional metadata categories and/or alter the colors. In this example we've added niche and clermontyping assignments for each genome and we've added color assignments for each unique niche or clermontyping category.

Alternately, Case one and/or two may be skipped entirely if the user prepares a properly formatted meta data file (predicted_clusters.tsv) and a meta colors file on their own.

Meta file formatting

See example files in tests folder

Meta data: A tab separated metadata file with a row for each genome in the ANI file with the exact genome name, and columns for each meta value. File should include a header.
Meta colors: A tab separated file of unique meta catagories and colors. Note: Each meta data category must be unique across all meta data (e.g. Site B1, Phylogroup B1 cannot both be B1).

Meta data example:

Genome	Site	Niche	Phylogroup
Genome1	Site A1	Sheep	B1
Genome2	Site B1	Cow	E
Genome3	Site C1	Pig	A

Meta colors example:

Site A1	#ffffb3
Site B1	#377eb8
Site C1	#ff7f00
Sheep	#f781bf
Cow	#4daf4a
Pig	#3f1gy5
B1	#f781bf
E	#4daf4a
A	#3f1gy5

Use -m and -c parameters to input file paths

An easy way to customize the metadata is to run HeatSeq™ like Case One or Case Two, and then open the predicted_clusters.tsv and meta_colors.tsv files in Excel or other spreadsheet program to amend and/or append the desired data.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/03_example_default -m files/01_example_predicted_clusters.tsv -c files/01_example_meta_colors.tsv

(Return to Table of Contents)

Examples

fastANI

fastANI is the default case and was utilized in the case examples above. Here we show examples with some optional parameters

fastANI: https://github.com/ParBLiSS/FastANI

# Use fastANI in many to many WITHOUT the --matrix parameter
# if used --matrix and have the fastANI.matrix file use -dtype ANI outlined below.
./fastANI --ql [QUERY_LIST] --rl [REFERENCE_LIST] -o 01_example_fastANI_allV.tsv

# set min and max
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xa_example_default -dmin 90 -dmax 99 -user 98.5 96.5

# no metadata or predicted clusters
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xb_example_default -no True

# different method and metric
python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/xc_example_default -method centroid -metric correlation -user 99.5 98.5 97.5 96.5

ANI

This option is to accommodate ANI values estimated with other tools. Arrange the data into a square matrix and format at a tsv file.

This matrix should be actual ANI values with 100's along the diagonal (for self matches) and not 100-ANI with 0's on the diagonal.

python HeatSeq.py -i files/01_example_fastANI_allV.tsv -o tests/04_example_ANI -dtype ANI

this option has incomplete example and testing but should work.

AAI

This options alters the value range for hierarchical clustering and the heatmap to accommodate the lower percentage range of AAI estimates. Any AAI tool may be used. Arrange the data into a square matrix and format at a tsv file.

This matrix should be actual AAI values with 100's along the diagonal (for self matches) and not 100-AAI with 0's on the diagonal.

One AAI option is from the Enveomics Collection. Enveomics collection: http://enve-omics.ce.gatech.edu/enveomics/docs

First run aai.rb for all vs all genomes: http://enve-omics.ce.gatech.edu/enveomics/docs?t=aai.rb
Then run Table.df2dsit.R to convert to a square distance matrix: http://enve-omics.ce.gatech.edu/enveomics/docs?t=Table.df2dist.R

python HeatSeq.py -i files/05_example_AAI_allV.tsv -o tests/05_example_AAI -dtype AAI -l True -hcg 3 -m files/05a_metadata.tsv -c files/05b_metacolors.tsv

Mash

This option takes the output from Mash. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.

Mash releases: https://github.com/marbl/Mash/releases Mash docs: https://mash.readthedocs.io/

# Step 1: create sketch of all genomes or metagenomes.
mkdir mySketches
for f in myFiles/*fasta; do n=`basename $f | cut -d. -f1`; ./mash sketch -o mySketches/$n $f; done
# * # myFiles may contain metagenomes or genomes in fastq or fasta format.

# Step 2: concatenate sketches in single file.
./mash paste allSketches mySketches/*.msh

# step 3. compute all vs all dist with allSketches.
./mash dist allSketches.msh allSketches.msh -t > files/06_example_mash_allV.tsv

# Step 4: predict clusters and build heatmap figure.
python HeatSeq.py -i files/06_example_mash_allV.tsv -o tests/06_example_MASH -dtype Mash -user 0.06 0.08 0.1

Simka

This option takes the output from Simka. It alters the value range for hierarchical clustering and the heatmap to fit 0-1. Diagonals (self matches) should be 0's.

Simka: https://github.com/GATB/simka

# Step 1: Generate all vs all input file.
>for f in ${reads_dir}/*1.fastq; do n=basename $f | cut -d_ -f1; echo '${n}:$f;${n}2.fastq'; done > simka_input.txt

# Step 2: run all vs all simka with 50GB and 12 cores.
simka -max-memory 50000 -nb-cores 12 -in simka_input.txt -out simka_out -out-tmp simka_temp

# Step 3: predict clusters and build heatmap figure.
python HeatSeq.py -i files/07_example_simka_allV.txt -o tests/07_example_SIMKA -dtype Simka -user 0.5 0.7 0.9

Distance

Any square matrix with values 0-1 may be used. Matrix file should be formatted as a tsv file with diagonals (self matches) equal to 0.

this option has incomplete example and testing but should work.

python HeatSeq.py -i files/08_example_distance_allV.tsv -o tests/08_example_distance -dtype Distance

(Return to Table of Contents)

Software Dependencies

External dependencies

Code in this repo uses only Python and Python packages which are listed below. However, it requires the user to independently compute distance (ANI, AAI, Mash etc.) with an external tool prior to running this code.

Use fastANI, enveomics, Mash, Simka etc.

Required packages for Python

Python and all packages can be easily installed with conda or pip. Prodigal, BLAST+ and MMseqs2 can also be installed easily with Conda. Just search "conda install name"

References

Sanner MF. Python: a programming language for software integration and development. J Mol Graph Model. 1999 Feb 1;17(1):57-61.
Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009.
McKinney W, others. Data structures for statistical computing in python. In: Proceedings of the 9th Python in Science Conference. 2010. p. 51–6.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–62.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–72.
Hunter JD. Matplotlib: A 2D graphics environment. Computing in science & engineering. 2007;9(3):90–5.
Waskom ML. Seaborn: statistical data visualization. Journal of Open Source Software. 2021 Apr 6;6(60):3021.

(Return to Table of Contents)

How to Cite

If you use any part of this workflow in your research please cite the following manuscript:

PLACEHOLDER FOR MANUSCRIPT CITATION AND LINK

Future Improvements

Add additional parsing support for outputs from other tools to save the user from creating the distance matrix.
Add support for checkM or other tools to create the option of selecting a cluster representative for each predicted cluster.
Add cluster analysis tools and figures to evaluate the effectiveness of clustering at each distance threshold. Silhouette plots etc.
Add KDE bandwidth parameter optimization.

(Return to Table of Contents)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
files		files
images		images
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
HeatSeq.py		HeatSeq.py
LICENSE		LICENSE
README.md		README.md

License

rotheconrad/HeatSeq

Folders and files

Latest commit

History

Repository files navigation

HeatSeq™: cluster munitions for microbial genomics

Table of Contents

Example output

Finished heatmap figure

Predicted clusters table (see: files/example_predicted_clusters.tsv)

Additional output files (see tests folder for actual files)

Usage

Case One: Default

KDE with peaks (local maximums) and valleys (local minimums)

Case Two: User Defined

Case Three: Custom Metadata

Meta file formatting

Examples

fastANI

ANI

AAI

Mash

Simka

Distance

Software Dependencies

External dependencies

Required packages for Python

References

How to Cite

Future Improvements

About

Resources

License

Stars

Watchers

Forks

Languages