abSENSE: a method to interpret undetected homologs

INTRODUCTION

Welcome! This is the public Git repository for abSENSE, a method that calculates the probability that a homolog of a given gene would fail to be detected by a homology search (using BLAST or a similar method) in a given species, even if the homolog were present and evolving normally.

The result of this calculation informs how one interprets the result of a homology search failing to find homologs of a gene in some species. One possibility to explain such a result is that the gene is actually absent from the genome in that species: a biological, and potentially interesting (e.g. if due to a gene loss or the birth of a new gene), result.

A second explanation, often ignored, is that the homolog is present in the genome of that species, but that the homology search merely lacks statistical power to detect it. Here, the apparent absense of the homolog is a technical/statistical limitation, and does not reflect underlying biology.

By calculating the probability that your homology search would fail to detect a homolog even if one were present and even if it were evolving normally (e.g. no rate accelerations on a specific branch, potentially suggestive of biologically interesting changes), abSENSE informs the interpretation of a negative homology search result. If abSENSE finds that there is a high probability of a homolog being undetected even if present, you may not be as inclined to invoke a biological explanation for the result: the null model of a failure of the homology search is sufficient to explain what you observe.

The method is explained in complete detail in the paper in which it's introduced. This is the paper that is referred to in the rest of this README:

Weisman CM, Murray AW, Eddy SR (2020) Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol 18(11): e3000862. https://doi.org/10.1371/journal.pbio.3000862

There, it is applied to the specific case of lineage-specific genes, for which homologs appear absent in all species outside of a narrow lineage. The method itself is applicable to any case in which a homolog appears absent (e.g. a single species missing a homolog that one might interpret as a gene loss), and likewise, this code is applicable to all such cases.

In this repo, you will find:

a) Code that can perform this analysis using the required input data (see below)

b) All input data used in the fungal and insect lineages discussed in Weisman et al 2020 that can be used as input to the analysis code.

GUI WEBSITE AVAILABLE FOR SIMPLE ANALYSES

We have a GUI website that can perform basic single-gene analyses that don't require the advanced command line options available here (see below). If this is your case, you may find it easier to use than downloading and running this code on the command line.

The website is pre-loaded with the input data required to perform these analyses for all of the fungal and insect genes analyzed in Weisman et al 2020. It can also perform analyses on user-provided input data.

The website is available at: http://eddylab.org/abSENSE/

1. RUNNING abSENSE: THE BASICS

Quickstart: main analysis

The main analysis script, Run_abSENSE.py, calculates the probabilities of homologs of genes from some "focal" species being undetected in a set of N other species. It can perform this analysis for an arbitrary number of genes in the taxon at one time. It requires a minimum of two input files:

i) A file containing the bitscores of homologs of each gene to be analyzed in at least three of the species (including the focal species itself, so two others).

ii) A file containing the N evolutionary distances, in substitutions/site, between the focal species and each other species. The distance between the focal species and itself should be 0. (If you don't already have such distances, a description of how to calculate them relatively painlessly can be found in Weisman et al 2020).

Examples of both of these files for a subset of genes from S. cerevisiae and their orthologs 11 other fungal species (the same species analyzed in Weisman et al 2020) can be found in the folder Quickstart_Examples: the bitscore file is Fungi_Example_Bitscores, and the distance file is Fungi_Distances. They exemplify the formatting required for abSENSE to run (explained in more detail below).

To run abSENSE on a given bitscore and distance file, type:

python Run_abSENSE.py --distfile (NAME OF DISTANCE FILE) --scorefile (NAME OF SCORE FILE)

For example, to run abSENSE on the example fungal genes, type:

python Run_abSENSE.py --distfile Quickstart_Examples/Fungi_Distances --scorefile Quickstart_Examples/Fungi_Example_Bitscores

For each gene in the input bitscore file, the following will be computed:

a) The probabilities of a homolog being undetected in each species (in the file Detection_Failure_Probabilities);

b) The expected bitscores of homologs in each species (in the file Predicted_bitscores);

c) The 99% confidence interval around this bitscore in each species (low and high bounds listed in separate files: Bitscore_99PI_lowerbound_predictions and Bitscore_99PI_higherbound_predictions)

These results will be output to a set of tab-delimited files in a separate directory, by default named with the start time of the analysis. (You can specify the name with a command line option, see below). Additional information on output files is below.

Quickstart: visualization

A supplemental visualization script, Plot_abSENSE.py, performs the same analysis as above, but for one gene at a time, and also produces a visualization of the results (see Weisman et al 2020). It is run in the same way, except that it also requires specifying which single gene in the bitscore input file you wish to analyze.

To run abSENSE on gene GENEID contained in a given bitscore with a given distance file, type:

python Plot_abSENSE.py --distfile (NAME OF DISTANCE FILE) --scorefile (NAME OF SCORE FILE) --gene GENEID

For example, to analyze the S. cerevisiae gene Uli1, listed in the bitscore file unde its RefSeq ID (NP_116682.3), type:

python Plot_abSENSE.py --distfile Quickstart_Examples/Fungi_Distances --scorefile Quickstart_Examples/Fungi_Example_Bitscores --gene NP_116682.3

The same results as above will be computed, but now they will be output to the terminal, and then the visualiation will be shown.

2. ADVANCED OPTIONS

You can specify advanced options with the additional command line options. You can view them all by typing

python Run_abSENSE.py --help

They are:

--out: The prefix of the directory to which your results will be output. Default is the time at which you ran the analysis (to avoid overwriting of results).

--Eval: The E-value threshold to be used (above this value, homologs will be considered undetected). Default is 0.001 (fairly permissive).

--genelenfile: Allows you to specify a file containing the lengths (in aa) of all genes in the bitscore file to be analyzed. Default is 400 amino acids (~average protein size in many species) for all proteins.

abSENSE predicts a bitscore, which is then converted to an E-value to determine detectability; this conversation technically requires knowledge of both the size of the database in which the search occurs (see below) and the length of the gene being searched. Because the conversion between these values and E-value is logarithmic, though, only fairly large changes in these values substantially affect results.

Examples of such files containing the lengths of all S. cerevisiae and D. melanogaster genes can be found in Fungi_Data/S_cer_Protein_Lengths and Insect_Data/D_mel_Protein_Lengths. The format required is described more below.

--dblenfile: Allows you to specify a file containing the sizes (in aa) of the database in which the homology search for each of your N species is performed. Default is 400 amino acids * 20,000 amino acids / gene = 8,000,000 amimo acids (~average protein and proteome size in many species) for all species.

abSENSE predicts a bitscore, which is then converted to an E-value to determine detectability; this conversation technically requires knowledge of both the size of the database in which the search occurs and the length of the gene being searched (see above). Because the conversion between these values and E-value is logarithmic, though, only fairly large changes in these values substantially affect results.

Examples containing the lengths of all S. cerevisiae and D. melanogaster genes can be found in Fungi_Data/Fungi_Database_Lengths and Insect_Data/Insect_Database_Lengths. The format required is described more below.

--predall: Default is False. When True, Causes abSENSE to calculate the probability of homologs being undetected, the expected bitscores, and 99% confidence intervals not only in species in which homologs were actually undetected, but also for species in which homologs have been found. This is obviously not the main use case, and is especially uninformative when those homologs and their bitscores have been used in the prediction itself (see below). May potentially be useful to see if a homolog in one species, although detected, seems to be behaving anomalously compared to those in other species (eg due to rate acceleration).

--includeonly: Allows you to restrict the species whose bitscores are used in predicting bitscores in other species. Mainly useful to do control-type analyses, such as Figure 5 in Weisman et al 2020, to show that predictions made from only a subset of the data are nonetheless reliable. If not specified, abSENSE uses all available bitscores in the prediction. The format here should be a list of the species names as they are written in the distance and bitscore files, separated by commas and NO SPACES. For example, to include only data from the yeasts S. cerevisiae, S. paradoxus, and S. kudriavzevii in an analysis based on the files in Fungal_Data, you would type: --includeonly S_cer,S_par,S_kud

For example, to run an analysis on all S. cerevisiae proteins in the selected fungal species in which the lengths of each S. cerevisiae protein and the sizes of each species' search database (their annotated proteomes as indicated in the supplement of Weisman et al 2020) are specified:

python Run_abSENSE.py --distfile Fungi_Data/Fungi_Distances --scorefile Fungi_Data/Fungi_Bitscores --genelenfile Fungi_Data/S_cer_Protein_Lengths --dblenfile Fungi_Data/Fungi_Database_Lengths

The visualization script Plot_abSENSE.py takes all of the same options, with the exception of again requiring that the gene to be analyzed is specified by the --gene option, and also that the --genelenfile option is instead --genelen, after which should be entered an integer corresponding to the length of the gene. (With a single gene, it's hardly worth requiring a whole file: just give the number.)

3: OUTPUT FILES

Run_abSENSE.py outputs six output files. Examples resulting from running abSENSE on the provided insect and fungal data can be found in Fungi_Data/Fungi_abSENSE_Results/ and Insect_Data/Insect_abSENSE_Results/ respectively.

a) Detection_failure_probabilities

The central output of the program. For each gene in the analysis, this contains the predicted probability that a homolog in each species would be undetected at the specified E-value by a homology search, even if the homolog were present.

By default, this is only calculated in species in which the gene was not detected. Results for species in which homologs were detected are therefore listed as "detected". The setting --predall will calculate this value for all species, even those in which a homolog was in fact detected.