ClaMSA - Classify Multiple Sequence Alignments

ClaMSA is tool that uses machine learning to classify sequences that are related to each other via a tree. It takes as input a tree and a multiple sequence alignment (MSA) of sequences and outputs probabilities that the MSA belongs to given classes. It is currently trained and tested to classify sequences of codons (= triplets of DNA characters) into coding (1) or non-coding (0). It builds on TensorFlow and a custom layer for Continuous-Time Markov Chains (CTMC) and trains a set of rate matrices for a classification task.

Above image shows two toy example input MSAs. Synonymous codons, which code for the same amino acid, have the same color.

Requirements

Python modules

tensorflow >= 2.0
biopython
regex
newick
tqdm
pandas
protobuf3-to-dict
matplotlib
seaborn

Install requirements with

pip3 install tensorflow biopython regex newick tqdm pandas protobuf3-to-dict matplotlib seaborn

Installation

Download ClaMSA with

git clone --recurse-submodules https://github.com/Gaius-Augustus/clamsa.git

Example Classification

The commands

cd clamsa

./clamsa.py predict fasta examples/msa.lst --clades examples/example_tree.nwk --use_codons

output the table

path                    clamsa
examples/msa1.fa        0.9539
examples/msa2.fa        0.1667

Here, the two toy example alignments msa1, msa2 pictured above are predicted to precoding with probabilities 0.9539 and 0.1667, respectively.
See the usage of prediction for an explanation of the command line structure.
See test/predict.sh for more explanations and a realistical application.

Input Tree Construction

For codon MSA classification we recommend that you construct a tree the following way:

Construct a set of codon MSAs just as you would do for prediction. You only need positive examples, i.e. alignments of actual coding sequences. One option to compile such a set is AUGUSTUS-CGP.
Construct a tree with MrBayes using a codon model as described in the supplementary material to below paper.

Other trees may work, but a good performance should only be expected if the tree is scaled to 1 expected codon mutation per time unit.

Train and Test Example Data

Obtain

codon alignment training data from a fly, vertebrate and yeast clade in tfrecords format and
codon alignment test data from vertebrates in fasta format with

cd data
./download_fly_vert_yeast_train.sh
./download_vert_test.sh

Training

ClaMSA can be trained for a classification task on a training set of labeled MSAs.
See test/train.sh for more explanations and the command line that ClaMSA was trained with.

Usages

Reference

Most of ClaMSA was written by Darvin Mertsch.

Please cite:
End-to-end Learning of Evolutionary Models to Find Coding Regions in Genome Alignments, Darvin Mertsch and Mario Stanke, Bioinformatics, btac028, published 21 Jan 2022

bioRxiv preprint

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
data		data
docs		docs
examples		examples
matrices		matrices
models		models
saved_weights		saved_weights
test		test
tf_tcmc @ ffccb89		tf_tcmc @ ffccb89
utilities		utilities
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
clamsa.py		clamsa.py

Gaius-Augustus/clamsa

Folders and files

Latest commit

History

Repository files navigation

ClaMSA - Classify Multiple Sequence Alignments

Requirements

Installation

Example Classification

Input Tree Construction

Train and Test Example Data

Training

Usages

Reference

About

Resources

Stars

Watchers

Forks

Languages