Skip to content

nstenz/TICR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TICR

These scripts can be utilized to perform highly parallelized concordance analyses on any given alignment, with a particular focus on very large datasets which may include dozens of taxa and may span entire chromosomes or genomes. If you use the method and for details, please cite

  • Stenz et al (2015). Exploring tree-like and non-tree-like patterns using genome sequences: An example using the inbreeding plant species Arabidopsis thaliana (L.) Heynh. Systematic Biology, 64(5):809-823. doi: 10.1093/sysbio/syv039
  • addendum describing a modification to the model in the original TICR test.

The goal of this pipeline is to

  1. analyze a very large alignment (many sites and many taxa) allowing for gene tree discordance due to recombination, to estimate a population tree. The tools used for this step are

    • MDL to delimit loci
    • MrBayes to perform individual gene analyses
    • BUCKy to estimate quartet concordance factors
    • Quartet MaxCut to estimate a binary population tree

    For an alternative option to perform MrBayes and BUCKy on a cluster that has a job scheduler, see here.

  2. test panmixia, the adequacy of the coalescent on this binary population tree, and estimate a partially resolved tree with current or ancestral panmictic populations. This test is available in the R package phylolm or here:

    • TICR test. Jump to:

Dependencies

  1. PAUP*
    • Used to perform quick parsimony analyses on all possible breakpoints to detect shifts in tree topology using MDL.
  2. mdl (included in source)
    • Places breakpoints in a given sequence in such a way that each partition has homologous tree topology.
  3. MrBayes
    • Used to generate posterior distributions for each partition generated by MDL.
  4. mbsum and BUCKy
    • mbsum: Summarizes the output from MrBayes into the required format for BUCKy.
    • BUCKy: Quantifies observed discordance between genes.
  5. Quartet MaxCut
    • Used to generate a population tree based on concordance factors of four-taxon sets present in the dataset.
  6. R
    • Used to read and draw Newick trees, also used to check for tree incongruence.
    • also requires the "ape" R library to be installed.

Important Notes

  • In order to function properly, any user-specified MrBayes block MUST allow logging to terminal.

  • For "accurate" MCMC convergence checks, the user must ensure that their specified diagnose frequency divides evenly into their specified total number of MCMC generations.

  • When specifying a machine file (i.e. --machine-file filename.txt) to outsource jobs run by mdl.pl, mb.pl, or bucky.pl, passwordless login via ssh MUST be enabled for each machine listed in the machine file. For instance, if the machine file contains the following text:

noah@hostA nstenz@hostB hostC ```

The user should be able to use ssh with each line as the only given option to the ssh command. In other words, typing the following into their terminal:

```

ssh noah@hostA ssh nstenz@hostB ssh hostC ```

should result in the user sshing to hostA, then hostB, and finally hostC without having to answer any command line prompts. Instructions for setting up passwordless ssh logins can be found [here](http://www.linuxproblem.org/art_9.html).

Supported Operating Systems

This script was tested and built on a cluster running Red Hat Enterprise Linux Server release 6.6 (Santiago), it should work for most other Linux distros, and might work on MacOS.

Pipeline overview

  1. mdl.pl
    • This script handles the parallelized execution of MDL. Upon successful completion, this script will have generated a gzipped tarball containing the sequence data of each partition found by MDL. This tarball is used as the input for the next script.
  2. mb.pl
    • This script handles the parallelized execution of MrBayes on all partitions located in the tarball generated by mdl.pl. Upon successful completion, this script will have generated a tarball containing the output from MrBayes for all partitions located in the input tarball. This tarball is used as the input for the next script.
  3. bucky.pl
    • This script handles the parallelized execution of BUCKy using the output from MrBayes stored in the tarball generated by mb.pl. Upon successful completion, this script will have generated a tarball containing the output from BUCKy for each possible quartet in the input sequence file. A summary of all concordance factors for each quartet will also be output in a csv file for user convenience.
  4. get-pop-tree.pl
    • This script utilizes the csv generated by bucky.pl to create a population tree for the input dataset. Upon successful completion, the script will generate a Newick tree representing the supported population tree.
  5. getTreeBranchLengths.r
    • This script takes the population tree generated by get-pop-tree.pl as well as the csv generated by bucky.pl to calculate branch lengths for the population tree in coalescent units.
  6. TICR.r
    • This script takes the csv file with all quartet concordance factors created by bucky.pl and the population tree with branch lengths created by getTreeBranchLengths.r to:
      • test panmixia
      • test the binary population tree
      • search for the best unresolved version of this population tree and test its goodness-of-fit.

mdl.pl

Script Work Flow

This script begins by reducing the given input alignment to only parsimony-informative characters using the program PAUP*. Once all parsimony-informative sites have been determined, PAUP* is used to calculate the parsimony score of every potential breakpoint which could occur in the input alignment. Using these parsimony scores, a Minimum Description Length criterion as implemented in the program mdl is utilized to determine where changes in tree topology across the alignment have occured (for a more in-depth overview of the principles behind MDL, read the following paper). This effectively splits the input alignment into partitions of sequence with homogeneous topology ("recombinational genes"). The start and end indices of these sequences are then converted back into their corresponding indices in the full alignment (non-parsimony informative characters included) which the user originally input. Non-parsimony informative characters located between two partitions are divided equally between them.

Usage

At the minimum, this script requires a single FASTA or NEXUS file containing the alignment of interest. Additionally, the length of the smallest possible partition which MDL can find must also be specified with -b or --block-size option. The script can then be called with the following invocation:

mdl.pl input.fa -b 100

This will then proceed to breakdown the alignment contained in input.fa into partitions having homologous topology and length of at least 100 parsimony-informative characters (potential output alignments would contain could potentially contain any multiple of 100 parsimony-informative characters up to the total number of parsimony-informative characters present in the alignment).

For very large alignments (technically alignments with many parsimony-informative characters), such as those that would be expected from chromosomes or genomes, the number of parsimony informative calculations required to run MDL to completion using the above method becomes unfeasibly large. Therefore, it is recommended to use the -f or --forced-break option like so:

mdl.pl input.fa -b 100 -f 10000

Including this option as seen above would force a breakpoint after every 10,000th parsimony-informative character in the input alignment. Each block of 10,000 parsimony-informative characters would then have MDL run on it independently. The final partitioning is then a concatenation of the independent MDL analyses on the forced breakpoints.

Command Line Options

The following table describes the support command-line optionswhich can be specified to influence the script execution:

Option Flag(s) Option Description Default
‑b, ‑‑block‑size sets the minimum number of parsimony-informative characters which can be found in a block none
‑f, ‑‑forced‑break forces a breakpoint after the specified number of parsimony-informative characters, these partitions are then run independenly none
‑‑gap‑as‑char specify to treat gaps as informative characters in PAUP* parsimony analyses gaps not informative
‑‑machine‑file file name containing hosts to ssh onto and perform analyses on, passwordless login MUST be enabled none
‑‑port specifies the port to utilize on the server 10001
‑o, ‑‑out‑dir name of the directory to store output files in "mdl-" + Unix time of script invocation
‑T, ‑‑n‑threads the number of forks ALL hosts running analyses can use concurrently current number of free CPUs
‑h, ‑‑help display help and exit N/A

Output Files

Given an input alignment named 'chromosome1.fa', the following files can be found in the output directory upon successful completion of the script:

  • chromosome1-reduced.tar.gz: zipped tarball containing the parsimony-informative character alignments for each forced breakpoint
  • chromosome1-scores.tar.gz: zipped tarball containing the parsimony scores calculated for each possible breakpoint in each forced partition (or the whole alignment if no forced partitioning was used)
  • chromosome1-partitions.tar.gz: zipped tarball containing the MDL output for each foreced partition of the alignment (or the whole alignment if no forced partitioning was used)
  • chromosome1.tar.gz: contains the actual alignments which resulted from the MDL partitioning
  • chromosome1-stats.csv: contains MDL partition start and end indices in terms of both the full and parsimony-informative character alignment, can be useful for determining average partition length, as well as the average number of parsimony-informative and non-parsimony informative characters present in each partition
  • symlink to specified input file

mb.pl

Script Work Flow

This script begins by appending the user-specified MrBayes block to each Nexus file representing an MDL partition located in the input tarball created by mdl.pl. Once this is complete, the script proceeds to run MrBayes on each MDL partition. The results for each partition are then tarballed and gzipped. The tarballs for each partition are then aggregated into a single tarball.

Usage

This script can be used to perform the parallelized execution of many MrBayes runs. Additionally, it can also be used to perform rudimentary MCMC convergence checks and remove non-convergent partitions.

Running MrBayes

When running MrBayes, this script utilizes the zipped tarball created by mdl.pl as its input. For example, if the following command was used to partition an alignment:

mdl.pl chromosome1.fa -b 100 -f 10000 -o chr1-mdl

the required input for mb.pl would be named 'chromosome1.tar.gz'. Additionally, this script requires a file containing a properly formated MrBayes block (specified with -m or --mb-block flag). This MrBayes block contains the actual commands that will dictate how the script calls MrBayes. The following represents an example invocation:

mb.pl chr1-mdl/chromosome1.tar.gz -m bayes.txt -o chr1-mb

If the above command was for some reason cancelled, interrupted, or succesfully completed but perhaps had its output tarball modified using the --remove option, it is also possible to rerun only the needed partitions by specifying the previous output directory of the script as input like so:

mb.pl chr1-mb -m bayes.txt 

MCMC Convergence

The terminal output of each MrBayes run is by default stored by this script, if the original MCMC analysis were run with the following command:

mb.pl chr1-mdl/chromosome1.tar.gz -m bayes.txt -o chr1-mb

this information, as well as the resulting tree and parameter files for each MrBayes analysis, would be expected to be located in a tarball named 'chromosome1.mb.tar'.

This output tarball can be parsed to determine the final standard deviation of split frequencies and roughly estimate whether or not the MCMC analyses which they describe reached convergence. If the user simply wants to check which partitions meet a specific threshold, they can use the -c or --check option like so:

mb.pl chr1-mb -c 0.05

The script will then proceed to output the final recorded standard deviation of split frequencies found in the log file of the MrBayes analysis for the partition and whether or not it met the specified threshold.

If the user wishes to remove partitions below a specific threshold (perhaps as a precursor to rerunning MrBayes with altered settings), they can use the -r or --remove option:

mb.pl chr1-mb -r 0.05

Command Line Options

The following table describes the support command-line optionswhich can be specified to influence the script execution:

Option Flag(s) Option Description Default
‑m, ‑‑mb‑block text file containing MrBayes commands to append to each input partition none
‑c, ‑‑check outputs how many MrBayes runs standard deviation of split frequencies have reached the specified threshold N/A
‑r, ‑‑remove removes MrBayes runs with standard deviation of split frequencies below the specified threshold N/A
‑‑machine‑file file name containing hosts to ssh onto and perform analyses on, passwordless login MUST be enabled none
‑‑port specifies the port to utilize on the server 10002
‑o, ‑‑out‑dir name of the directory to store output files in "mb-" + Unix time of script invocation
‑T, ‑‑n‑threads the number of forks ALL hosts running analyses can use concurrently current number of free CPUs
‑h, ‑‑help display help and exit N/A

Output Files

Given an input partition tarball named 'chromosome1.tar.gz', the following files can be found in the output directory upon successful completion of the script:

  • chromosome1.mb.tar: contains all output from the MrBayes analysis of each partition contained in 'chromosome1.tar.gz'
  • symlink to specified input file

bucky.pl

Script Work Flow

This script begins by summarizing the MrBayes output files for each individual partition using the program mbsum. Once all posteriers for all partitions have been summarized, this script calculates all the possible quartets for the given alignment and runs each one through BUCKy do determine the concordance factors for the three possible splits.

Usage

At the minimum this script only requires the tarball created by mb.pl as its input. For example, if the you used the following command to run MrBayes:

mb.pl chr1-mdl/chromosome1.tar.gz -m bayes.txt -o chr1-mb

the required input for bucky.pl would be located in chr1-mb/ and named 'chromosome1.mb.tar'. The following represents an example invocation:

bucky.pl chr1-mb/chromosome1.mb.tar -o chr1-bucky

If the above command was for some reason cancelled or interrupted, it is also possible to rerun only the needed quartets by specifying the previous output directory of the script as input like so:

bucky.pl chr1-bucky 

Command Line Options

The following table describes the support command-line options which can be specified to influence the script execution:

Option Flag(s) Option Description Default
‑a, ‑‑alpha value of alpha to use when running BUCKy, use "infinity" for infinity 1
‑n, ‑‑ngen number of generations to run BUCKy MCMC chain 1,000,000 generations
‑‑machine‑file file name containing hosts to ssh onto and perform analyses on, passwordless login MUST be enabled none
‑‑port specifies the port to utilize on the server 10003
‑o, ‑‑out‑dir name of the directory to store output files in "bucky-" + Unix time of script invocation
‑T, ‑‑n‑threads the number of forks ALL hosts running analyses can use concurrently current number of free CPUs
‑s, ‑‑no‑mbsum informs the script that the input is a tarball containing output already parsed by mbsum off
‑h, ‑‑help display help and exit N/A

Output Files

Given an input tarball containing MrBayes output named 'chromosome1.mb.tar', the following files can be found in the output directory upon successful completion of the script:

  • chromosome1.mbsum.tar.gz: contains the run output files produced during the summary of each MrBayes run with the program mbsum
  • chromosome1.BUCKy.tar: contains the output files generated when running BUCKy on each quartet
  • chromosome1.CFs.csv: contains the concordance factors for each possible split of each possible quartet as well as their 95% confidence intervals
  • symlink to specified input file

get-pop-tree.pl

Script Work Flow

This script estimates a binary population tree using Quartet MaxCut. It begins by reading in the csv created by bucky.pl. For each four-taxon set, the split with the highest corresponding concordance factor (the primary split) is the selected for input for the program Quartet MaxCut. In the case where there is a tie for the primary split, all of these tied splits will be output to a file, which is a fairly unlikely although not impossible scenario.

Usage

The command line option required for running this script is the name of the quartet csv file that the user desires to create a population tree from. For instance, if bucky.pl had been run with the following command:

bucky.pl chromosome1.mb.tar -o chr1-bucky

The population tree could be generated by utilizing the following command:

get-pop-tree.pl chr1-bucky/chromosome1.CFs.csv

Output Files

Given an input concordance fator csv named 'chromosome1.CFs.csv', the following files will be created:

  • chromosome1.QMC.txt: contains the splits utilized by Quartet MaxCut which were used to determine the population tree.
  • chromosome1.QMC.tre: contains the actual tree (in Newick format) generated by Quartet MaxCut.

getTreeBranchLengths.r

Script Work Flow

This R script takes a binary population tree and the csv file containing the quartet concordance factors to fit branch lengths in coalescent units. For each edge, it determines the 4-taxon sets whose internal branch span this edge and this edge only, calculates the mean concordance factor (CF) of the quartet displayed in the tree across these 4-taxon sets, then transforms this average major CF into coalescent units using the coalescent process assumption: length = - log(3/2 * (1-CF)).

Usage

If bucky.pl and get-pop-tree.pl were run as shown above, the CF data should first be copied to the main example directory:

cp chr1-bucky/chromosome1.CFs.csv .

and then getTreeBranchLengths.r can be run from the command line:

Rscript --vanilla getTreeBranchLengths.r chromosome1

This script requires that R and the "ape" package have already been installed.

Output Files

Given an input concordance factor csv named 'chromosome1.CFs.csv' and an input tree topology named 'chromosome1.QMC.tre':

  • chromosome1.QMClengths.tre: contains a tree (in Newick format) with the same topology as the input tree, but including branch lengths in coalescent units.
  • chromosome1.QMClengths.pdf: plot of this tree showing its topology, branch lengths in coalescent units (above branches) and average quartet concordance factors (below branches).
  • chromosome1.CFbyEdge.csv: table with one edge per row, listing the estimated edge length, the average major quartet CF on this edge and the number of 4-taxon sets used to calculate this CF.

TICR.r

All the R functions and data examples used here are available in the R package phylolm. These functions are test.one.species.tree (to test panmixia or a single tree) and stepwise.test.tree (to estimate a partial tree with periods of panmixia).

Script Work Flow

This script is provided for convenience to apply the test of panmixia, binary tree, and partial tree on a given data set. It takes in the csv file with the quartet concordance factors created by bucky.pl, and the tree created by get-pop-tree.pl (with branch lengths fitted using getTreeBranchLengths.r).

Usage

If bucky.pl, get-pop-tree.pl and getTreeBranchLengths.r were run as shown above, the CF data and the binary population tree with branch lengths should be in the main example directory, named 'chromosome1.CFs.csv' and 'chromosome1.QMClengths.tre'. Then TICR.r can be run from the command line:

Rscript --vanilla TICR.r chromosome1

This script requires that R and the "ape" package have already been installed. It will use the R functions in 'testingTreeWithQuartetCF.r', which are placed in the 'ticr' directory.

Output files

Given an input concordance fator csv named 'chromosome1.CFs.csv' and guide tree (with branch lengths in coalescent units) named 'chromosome1.QMClengths.tre', the following files will be created:

  • chromosome1.ticr.txt: contains the results of the tests of panmixia, binary tree, partial tree search, and test of that partial tree.
  • chromosome1.ticr.pdf: contains a pdf showing the partial tree with edges reduced to zero to show current or past panmictic populations.

Example of TICR tests in other situations

The R functions provided in 'ticr/testingTreeWithQuartetCF.r' can be used on any quartet concordance factor data and guide tree. For instance, one might get quartet CFs from multiple genes analyzed with RAxML. For each set of 4 taxa, one might drop any gene whose quartet tree is supported with less than 80% bootstrap (say), and calculate concordance factors based on the remaining genes for this 4-taxon sets. One might get the guide tree from running ASTRAL on all the full gene trees, for instance, rather than Quartet MaxCut on the major quartets. To perform the TICR tests to such data and guide tree, and to follow-up with the identification of taxa responsible for outlier quartets, see 'example.r' in the 'ticr/' directory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published