Skip to content

Analyze how gaps affect the detection of relaxed selection by 'RELAX' in terms of number of gaps, length of gaps and number of species.

Notifications You must be signed in to change notification settings

Efficiency-of-RELAX/RELAX-on-CYP8B1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Effect of gaps in detecting Relaxed Selection by RELAX

The objective of this assignment is to analyze the effect of gaps and number of species in efficiency of detecting relaxed selection by RELAX in terms of :

  • number of gaps
  • length of gaps
  • number of species

What is Relaxed Selection ?

Relaxed selection is a selective phenomena that occurs when selective pressures are either eliminated or reduced.

  • Biotic sources :
    • Predation elimination
    • Elimination of pathogen
  • Abiotic sources :
    • Changes in light, temperature or water
    • Changes composition of soil or mineral

Alternatives to Relaxed selection.

Selection of any form (balancing, directional, etc.) can be relaxed. An example from Lahti et al. 2009 is here.

   A hypothetical environmental change from an ancestral condition : possible outcomes -> (a) to (e)

Detection of selection in Genomic data.

We can detect signatures of selection from DNA sequence of organisms. The past fifty six years have seen the development and application of numerous statistical methods to identify genomic regions that appear to be shaped by natural selection. Natural selection is based on the simple observation of fitness-enhancing traits.

                   The way in which selection become observable and quantifiable.

Given that selection operates at the level of the phenotype, alleles showing evidence of selection are likely to be of functional relevance. There are several approaches available to detect selection at macroevolutionary scale.

What these methods does is :

  • identify sequences that are likely to be functional (coding or conserved)
  • Then search for lineage-specific accelerations in the rate of evolution.

Such accelerations are indicated by an excess of substitutions relative to the baseline mutation rate, which can be calculated from the number and rate of synonymous mutations.


Method to use ?

Here we used a general hypothesis testing framework called RELAX from Hyphy package. HyPhy (Hypothesis Testing using Phylogenies) is an open-source software package for the analysis of genetic sequences for inferring natural selection using techniques of :

  • phylogenetics
  • molecular evolution
  • machine learning

HyPhy distributes a variety of methods for inferring the strength of natural selection from the genetic data. In the case of branch-based methods for detecting selection, there is Relax.

The Decision tree : to find the appropriate method for detecting the molecular process of interest. 

RELAX is a hypothesis testing framework that asks whether the strength of natural selection has been relaxed or intensified along a specified set of test branches.


What are Gaps ?

Here the objective is to analyze the effect of gaps in detecting relaxed selection. Essentially, a gap occurs if something happens in our genome that can't be explained by uniformity and is also more than just mis-sequencing.

Here are some types of genome assembly gaps from Chaisson et al., 2015 :

(a) Sequence-coverage gaps: absence or reduction in sequence reads at that location.

(b) Segmental duplication-associated gaps: high sequence identity make read overlaps ambiguous.

(c) Satellite-associated gaps: higher-order tandem arrays of repetitive sequence cause read 'pileups'.

(d) Muted gaps: Contracted assembly relative to true genome.

One of the first problems anyone who do sequencing have to tackle is to distinguish the gap source between sequencing or alignment error versus actual indel in DNA. In such cases we have to minimize the false positives (type 1 error) and false negatives (type 2 error).


Objective :

The presence of gaps can lead to several problems and ambiguities in assembly or alignment and hence the downstream analysis. These could lead to misinterpretation of the biology of data we are analyzing. As a matter of fact, here we try to analyze how the presence of gaps affect a particular downstream analysis - inference of strength of Natural Selection.

Two approaches to do this:

(1) Using a gene known to be under relaxed selection.

(2) Using simulated data.

In both cases we variably mask certain parts of the sequence as gaps and analyze the p values and k values inferred by Relax. To mask the sequence we used Bedtool's commands :

  • random - generate a random set of intervals.
  • maskfasta - masks sequences based on intervals.

                      How the bedtools 'random' and 'maskfasta' command works.

For the first approach we took a gene known to be under strong relaxed selection. Here we choose the gene CYP8B1 which is found and verified to be under strong relaxed selection in some mammals and birds (which come under a common clade called 'Amniota') by Shinde et al., 2019.

                            A small overview about CYP8B1 gene and protein

CYP8B1 is a single exonic gene that determines the ratio of primary bile salts. The loss of this gene has been linked to lack of cholic acid in naked mole rats, elephants and manatees. The Sagar et., 2019 used CYP8B1 gene ORFs from more than 200 species of birds and mammals to look for signatures of relaxed selection.

               The taxonomic orders in Sagar et al., 2019 study are boxed red ~ 15 groups.   

The test for the relaxed selection of CYP8B1 gene in the amniotes is carried out as per the pipeline mentioned in the Shinde et al., 2019. The major steps used in detection of the relaxed selection are listed below and a more detailed information is given in projects.


Workflow

Prerequisites :

  • PRANK (v.140603)
  • MUSCLE (v3.8.31)
  • MAFFT (v7.407)
  • CLUSTALW (2.0.12)
  • DAMBE (7.0.58)
  • bam-readcount (0.8.0)
  • MUMSA (1.0)
  • modeltest-ng
  • raxml-ng
  • HyPhy (2.3.14)
  • &nsbp;


    Results

    Data is organised into the following folders

  • ORFs: Each file in this folder contains the complete open reading from of the CYP8B1 gene starting from start codon all the way till the stop codon
  • SAMs: Each file in this folder contains the results of performing SRA blastn search against publically available raw read data from the short read archive (SRA)
  • MSAs: Each file in this folder contains the results of multiple sequence alignment of the ORF files using guidance with PRANK, CLUSTALW, MAFFT or MUSCLE as the aligner
  • gc_content: The GC content and GC deviation are calculated for each ORF in window size of 100 with a step size of 10. The script plotGC_content.r is used to visualise these results
  • scripts: The scripts used for performing the ORF validation, multiple sequence alignment, model testing, tree topology inference and tests for relaxed selection are provided. Contents of this folder (scripts and instructions) along with published software tools should be suffecient to replicate all the results described in the manuscript.
  • relaxation_tests: Output files obtained after running the RELAX program implemented in the HYPHY package.
  • About

    Analyze how gaps affect the detection of relaxed selection by 'RELAX' in terms of number of gaps, length of gaps and number of species.

    Topics

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages

    • Perl 39.6%
    • R 31.4%
    • Shell 29.0%