GitHub - vizkidd/stop_codon_plants: A pipeline to estimate the rate of purifying selection acting on stop codons of ortholog clusters from OrthoDB.(Based on https://github.com/cseoighe/StopEvol)

Evolutionary selective constraints acting on the stop codon across land plants

Motivation

All genomes are under an evolutionary pressure and struggle to keep the functional portion of the DNA. The rate at which favorable genes are retained and deleterious ones are lost is exerted by a parameter which is the ratio of synonymous(dS) to non synonymous(dN) mutation rates. Substitutions do not alter the coded amino acid while mutations do. This makes substitutions helpful and mutations harmful. When dN/dS < 1, substitution rate is greater than mutation rate and the gene is said to be under purifying selection. Purifying selection favors synonymous substitutions than non-synonymous mutations thereby preventing change of an amino acid residue at a give position. In conventional models of substitution only the sense(non-stop) codons are accounted for while the non-sense codons are omitted because they do not contribute to amino acid changes. Since stop codons function with varying efficiencies, they can be read-through and have the ability to alter the final protein products. When combined with other mechanisms like ribosome stalling and mRNA regulation, stop codons can indirectly modulate protein synthesis. This gives meaning to stop codon preservation and substitution, thereby creating the need to include them in standard models of substitution. Stop codons have a low probability of undergoing mutations but the pressure acting on their rate of substitution is only vaguely addressed. Seioghe et al. have introduced a new model which incorporates stop codons into the general Muse & Gaut substitution model. The model is constructed based on the assumption that stop codons are also under selection pressure and co-evolve with the genes. The extended Muse & Gaut model, casually called extMG model, has been applied on mammalian orthologous sequences and 50% of the genes were found to be under purifying selection. In this study the extMG model of substitution, for all 64 codons, is used to estimate phi (rate of substitution between stop codons) for plant ortholog families under the Viridiplantae clade.

Installation

Requires

Download the archive as a zip and extract it
These flat files from OrthoDB are required to be in same directory as the extracted files

NOTE:

The files are from OrthoDB version 10. But any other versions which are compatible and have the same file structure can be used.
Make sure to extract and rename the flat files to the names provided above.

File Descriptions

Flat Files

levels.tab - File with NCBI taxonomic nodes [first column] and node names [second column].
species.tab - Contains organism IDs [second column] and organism names [third column].
OGs.tab - Contains information about orthologous groups (OG), OG ID [first column] & OG name [last column]. OG ID has a format of [cluster_ID]at[taxa_node]
levels2species.tab - Connects NCBI taxa node [first column] to organism IDs (same as species IDs) [second column] and also provides information about number of hops & NCBI taxonomic levels [last column]
OG2genes.tab - Connects OGs to genes. Contains OG IDs [first column] & gene IDs [last column]. Each gene ID is of the format [organism_ID]:[gene_ID]. (This can be used to accumulate clusters based on organisms or genes.)

Scripts

Main scripts are

START.sh - Starts the pipeline. Selects clusters based on the organism_ID and cluster_ID provided. Passes the clusters one at a time to download_data.sh.
download_data.sh - Download the CDS of gene_IDs in each cluster and selects one sequence for each organism.
process_data.sh - Process the data. Applies the extended Muse & Gaut model.

Run these after the clusters are downloaded (or if cluster download is stopped)

MIXMOD_BOOTSTRAP.sh - Applies the mixture model and performs bootstrapping

The overall flow including the misc scripts goes like this

NOTE: **Scripts for Phytomine and ensemble have also been provided, check Misc/

Usage & Flow

Pipeline can be started using the START.sh script,

sh START.sh <organism_ID> <upper_limit_scale> <cluster_ID> <min_orgs>

*ALL are required*
- <organism_ID> - NCBI Tax ID for organisms which you want to analyze
# <upper_limit_scale> - Max number of sequences in each cluster
+ <cluster_ID> - NCBI Tax ID for clusters which you want to analyze
# <min_orgs> - min number of orgs you want in a cluster, 0 if you want to include all clusters

eg,

sh START.sh 3193 10 33090 20

NOTE: Scripts were written for slurm, if you don't have slurm modify the sbatch lines in START.sh, download_data.sh, process_data.sh and just call the command without sbatch.

NOTE: Both the organism and the cluster IDs can be selected from the first column of the levels.tab file.

For this study, organisms in Embryophyta and the clusters in Viridiplantae are selected.

The pipeline can

For each cluster
- Downloads CDS (coding sequences) from NCBI for the cluster.
- Selects one sequence for each organism in the cluster.
- Applies the extended Muse & Gaut model which estimates the kappa, omega , phi & treescale for the cluster.
For all clusters
- Applies the mixture model
- Performs bootstrapping

On an average, each sequence takes 4~5 seconds to download in order to respect the NCBI query laws.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Figures		Figures
Misc		Misc
LRT.sh		LRT.sh
MIXMOD_BOOTSTRAP.sh		MIXMOD_BOOTSTRAP.sh
Manuscript.pdf		Manuscript.pdf
Presentation.pdf		Presentation.pdf
README.html		README.html
README.md		README.md
START.sh		START.sh
align_seqs.R		align_seqs.R
align_seqs_init.R		align_seqs_init.R
bootstrap.r		bootstrap.r
cds_from_ncbi.py		cds_from_ncbi.py
cleanup.sh		cleanup.sh
cut_stops.py		cut_stops.py
download_data.sh		download_data.sh
extract_cluster_sequences.py		extract_cluster_sequences.py
extract_org_ids.py		extract_org_ids.py
felsen.r		felsen.r
get_cluster_counts.py		get_cluster_counts.py
get_phi.sh		get_phi.sh
get_sumL.r		get_sumL.r
gtool.py		gtool.py
lik_fun.r		lik_fun.r
lik_fun_spline.r		lik_fun_spline.r
mixture_model.rscript		mixture_model.rscript
mmix.sh		mmix.sh
names2dict.py		names2dict.py
orthodb_api.py		orthodb_api.py
parameter_formatter.py		parameter_formatter.py
plot.r		plot.r
process_data.sh		process_data.sh
remove_empty_sequences.py		remove_empty_sequences.py
run.r		run.r
select_sequences.py		select_sequences.py
setup.r		setup.r
sim_mgf1x4.r		sim_mgf1x4.r
stop_codon_locations.py		stop_codon_locations.py
stop_specific_boot_proc.R		stop_specific_boot_proc.R
stopcodon.R		stopcodon.R
stopcodon1.R		stopcodon1.R
trim_fasta_header.py		trim_fasta_header.py
weights_phis.r		weights_phis.r

vizkidd/stop_codon_plants

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Motivation

Installation

Requires

File Descriptions

Flat Files

Scripts

Usage & Flow

Output

About

Topics

Resources

Stars

Watchers

Forks

Languages