quickCOAT: quick Concatenated Ortholog Alignment Tree

quickCOAT produces a concatenated protein alignment based on input protein sequences from several genomes. It starts out by defining single-copy orthologs amongst the set of genomes you specify and uses those to build the alignment. A set of closely related organisms will therefore have a long alignment to compensate for limited divergence, while distantly related genome phylogenies will be based on fewer orthologs. In this way, quickCOAT is a fast, automated way to define the best possible set of orthologs for your concatenated protein phylogeny.

Installation

Prerequisites

The following programs must be installed and executable from your $PATH:

Julia version 1.0 or higher
- Julia packages DataFrames, DataStructures, Bio, CSV, and Missings must also be installed. To install these, type julia to enter the Julia REPL then ]add DataFrames DataStructures CSV Missings to install the package. Once package installation is complete, type backspace then exit() to exit the Julia REPL.
BLAST+
muscle

Alignments generated using quickCOAT may benefit from trimming using Gblocks.

Gblocks

You will also need some way of building a phylogenetic tree using the multiple sequence alignment that quickCOAT generates. Here are some options:

Installation

Download the newest release, make the files in the bin directory executable and add them to your $PATH. One way of achieving this on Linux or MacOS is:

wget https://github.com/ianpgm/quickCOAT/archive/v0.4.0.tar.gz
tar zxvf quickCOAT-0.4.0.tar.gz
chmod +x quickCOAT-0.4.0/quickcoat/bin/
cd quickCOAT-0.4.0/quickcoat/bin/
echo "export PATH=$PWD:\$PATH">>~/.profile

Open a new terminal window for the changes to take effect.

You can run the test to see whether quickCOAT is working correctly by typing quickcoat.run_test.

Usage

Make a new folder.
Copy all of the genomes you want to analyse into that new folder. Each genome should be a single fasta amino acid file containing that genome's protein sequences. The filename must end with .faa for quickCOAT to recognise it as an input file.
Run quickCOAT. Type quickcoat followed by the following parameters:

-r or --reference: The filename of your reference genome. Orthologs will be defined based on BLAST results relative to this genome.
-q or --query_folder: The name of the folder you created in step 1.
-e or --evalue_threshold: The maximum e-value from the BLAST results to have a pair of sequences count as an ortholog. For example, 0.00001. The default is infinite (no threshold).
-i or --identity_threshold: The minimum percentage identity from the BLAST results to have a pair of sequences count as an ortholog. For example, 35. The default is 0 (no threshold.)
-o or --output_folder: The name of the folder quickCOAT will create with your output files. This folder cannot already exist, otherwise it will produce an error.
-b or --bitscore_threshold: The bitscore ratio threshold to have a pair of sequences count as an ortholog. For example, 0.9. The default is 0 (no threshold).
-t or --threads: The number of blastp and muscle instances that will be run in parallel.

An example command: quickcoat -r genome_of_interest.faa -q input_sequence_folder -e 0.00001 -i 35 -t 8 -o output_folder
Some tree-building software requires a phylip- or nexus-formatted file for input (e.g. PhyML, MrBayes). Programs for this are included. Use the following commands: quickcoat.fasta_to_phylip input_sequence_folder/concatenated_alignment.faa and quickcoat.fasta_to_nexus input_sequence_folder/concatenated_alignment.faa. The files concatenated_alignment.phy or concatenated_alignment.nex respectively will appear in your output folder.

Output

The output will appear in the folder that you specify. The following files will be generated:

The reference BLAST database. You shouldn't have to look at this.
ortholog_table.tsv: This is a tab-separated-value table containing the identifiers all of the orthologs in your genome set.
single_copy_ortholog_table.tsv: This is a subset of the ortholog_table.tsv containing just those orthologs appearing exactly once in every genome. This is what the concatenated alignment is built on.
concatenated_alignment.faa: This is the concatenated protein alignment in FASTA format, suitable for building phylogenetic trees.
report.txt: This report stores the input files and parameters for the run, as well as the annotations of the proteins used for the alignment.
blast_output: The folder containing BLAST output for each blastp run (with bitscore ratio included by quickCOAT).

How it works

Getting help

If something isn't working, please post an issue on Github or send an email to the author, ianpgm at bios dot au dot dk.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
bin		bin
helper_scripts		helper_scripts
test_data		test_data
.gitignore		.gitignore
Readme.md		Readme.md
overview_chart.png		overview_chart.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

helper_scripts

helper_scripts

test_data

test_data

.gitignore

.gitignore

Readme.md

Readme.md

overview_chart.png

overview_chart.png

Repository files navigation

quickCOAT: quick Concatenated Ortholog Alignment Tree

Installation

Prerequisites

Installation

Usage

Output

How it works

Getting help

About

Releases 10

Packages

Languages

ianpgm/quickCOAT

Folders and files

Latest commit

History

Repository files navigation

quickCOAT: quick Concatenated Ortholog Alignment Tree

Installation

Prerequisites

Installation

Usage

Output

How it works

Getting help

About

Resources

Stars

Watchers

Forks

Languages