quickCOAT produces a concatenated protein alignment based on input protein sequences from several genomes. It starts out by defining single-copy orthologs amongst the set of genomes you specify and uses those to build the alignment. A set of closely related organisms will therefore have a long alignment to compensate for limited divergence, while distantly related genome phylogenies will be based on fewer orthologs. In this way, quickCOAT is a fast, automated way to define the best possible set of orthologs for your concatenated protein phylogeny.
The following programs must be installed and executable from your $PATH:
- Julia version 1.0 or higher
- Julia packages DataFrames, DataStructures, Bio, CSV, and Missings must also be installed. To install these, type
julia
to enter the Julia REPL then]add DataFrames DataStructures CSV Missings
to install the package. Once package installation is complete, type backspace thenexit()
to exit the Julia REPL.
- Julia packages DataFrames, DataStructures, Bio, CSV, and Missings must also be installed. To install these, type
- BLAST+
- muscle
Alignments generated using quickCOAT may benefit from trimming using Gblocks.
You will also need some way of building a phylogenetic tree using the multiple sequence alignment that quickCOAT generates. Here are some options:
Download the newest release, make the files in the bin
directory executable and add them to your $PATH. One way of achieving this on Linux or MacOS is:
wget https://github.com/ianpgm/quickCOAT/archive/v0.4.0.tar.gz
tar zxvf quickCOAT-0.4.0.tar.gz
chmod +x quickCOAT-0.4.0/quickcoat/bin/
cd quickCOAT-0.4.0/quickcoat/bin/
echo "export PATH=$PWD:\$PATH">>~/.profile
Open a new terminal window for the changes to take effect.
You can run the test to see whether quickCOAT is working correctly by typing quickcoat.run_test
.
- Make a new folder.
- Copy all of the genomes you want to analyse into that new folder. Each genome should be a single fasta amino acid file containing that genome's protein sequences. The filename must end with
.faa
for quickCOAT to recognise it as an input file. - Run quickCOAT. Type
quickcoat
followed by the following parameters:
-r
or--reference
: The filename of your reference genome. Orthologs will be defined based on BLAST results relative to this genome.-q
or--query_folder
: The name of the folder you created in step 1.-e
or--evalue_threshold
: The maximum e-value from the BLAST results to have a pair of sequences count as an ortholog. For example,0.00001
. The default is infinite (no threshold).-i
or--identity_threshold
: The minimum percentage identity from the BLAST results to have a pair of sequences count as an ortholog. For example,35
. The default is 0 (no threshold.)-o
or--output_folder
: The name of the folder quickCOAT will create with your output files. This folder cannot already exist, otherwise it will produce an error.-b
or--bitscore_threshold
: The bitscore ratio threshold to have a pair of sequences count as an ortholog. For example,0.9
. The default is 0 (no threshold).-t
or--threads
: The number of blastp and muscle instances that will be run in parallel.
- An example command:
quickcoat -r genome_of_interest.faa -q input_sequence_folder -e 0.00001 -i 35 -t 8 -o output_folder
- Some tree-building software requires a phylip- or nexus-formatted file for input (e.g. PhyML, MrBayes). Programs for this are included. Use the following commands:
quickcoat.fasta_to_phylip input_sequence_folder/concatenated_alignment.faa
andquickcoat.fasta_to_nexus input_sequence_folder/concatenated_alignment.faa
. The filesconcatenated_alignment.phy
orconcatenated_alignment.nex
respectively will appear in your output folder.
The output will appear in the folder that you specify. The following files will be generated:
- The reference BLAST database. You shouldn't have to look at this.
ortholog_table.tsv
: This is a tab-separated-value table containing the identifiers all of the orthologs in your genome set.single_copy_ortholog_table.tsv
: This is a subset of theortholog_table.tsv
containing just those orthologs appearing exactly once in every genome. This is what the concatenated alignment is built on.concatenated_alignment.faa
: This is the concatenated protein alignment in FASTA format, suitable for building phylogenetic trees.report.txt
: This report stores the input files and parameters for the run, as well as the annotations of the proteins used for the alignment.blast_output
: The folder containing BLAST output for each blastp run (with bitscore ratio included by quickCOAT).
If something isn't working, please post an issue on Github or send an email to the author, ianpgm at bios dot au dot dk.