GitHub

#Mutation conservation across organisms and chemical modifier screens in ALS yeast

##Mutation Conservation

####Summary of current data Genes: 260

For latest data and figure see the folder named conservation_latest_data

Throughout all methods pathogenic mutations are on average more conserved than the average amino acid in the protein (Jensen Shannon Divergence p: 7.78E-03, Shannon Entropy p: 1.73E-05, Sum of Pairs p: 1.68E-13), and more conserved than benign mutations (Jensen Shannon Divergence p: 1.05E-26, Shannon Entropy p: 1.55E-17, Sum of Pairs p: 2.98E-27)

The average number of mutations per gene is 11.59459459, and 19% are benign. On average 5.24609375 mutations are fully conserved which make up about 3.4% of the fully conserved amino acids on the proteins.

Notably, the difference in conservation between pathogenic mutations and the average amino acid seems to be more significant when we have more mutations (pearson correlation: -0.361300), and the relationship appears to follow a power trend.

####Understanding the data

Currently the folders and spreadsheets are poorly organized but am working on making the data more presentable and easier to understand. On the current repository the folder conservation latest data includes a folder with images, plots.ipynb which is the ipython notebook used to generate the figures, a spreadhseet called plot data which is where the data for the figures comes from, and a spreadsheet called conservation_summary which is the data itself.

On the conservation_summary spreadsheet a lot of abreviations are used on the colums so am going to explain what every column contains

Gene: gene symbol. The ones that are coloured red are because some of the mutations listed on clinvar did not align with the protein sequence, the ones in grey are the ones that are listed more than once with different gene symbols and the ones that are green are genes that we found to be particularly good examples of the overall trend we're seeing with the data
JSD A: average Jensen Shannon divergence score throughout the entire protein
JSDP A: average Jensen Shannon divergence score for pathogenic mutations
JSDB A: average Jensen Shannon divergence score for benign mutations
SE A: average Shannon Entropy score throughout the entire protein
SEP A: average Shannon Entropy score for pathogenic mutations
SEB A: average Shannon Entropy score for benign mutations
SoP A: average Sum of Pairs score throughout the entire protein
SOPP A: average Sum of Pairs score for pathogenic mutations
SOPB A: average Sum of Pairs score for benign mutations
#m: number of mutations, the ones that are labeled red are also because the number of mutations found on clinvar was not the same as the number of mutations that actually aligned with the protein sequence
#bm: number of benign mutations, these are also included on the mutation count on the column before
%bm: percentage of mutations that are benign
#fcm: number of mutations that are fully conserved
%fcm: percentage of fully conserved mutations
%fcm/faa: the percentage of fully conserved amino acids in the protein that contain mutations
species: this column contains letters indicating the species in which the gene is found (h: human, m: mouse, z: zebrafish, d: drosophila, c: c. elegans, y: yeast)
JSD SD: Jensen Shannon Divergence Standard Deviation
JSDP SD: Jensen Shannon Divergence Pathogenic Standard Deviation
JSDB SD: Jensen Shannon Divergence Benign Standard Deviation
SE SD: Shannon Entropy Standard Deviation
SEP SD: Shannon Entropy Pathogenic Standard Deviation
SEB SD: Shannon Entropy Benign Standard Deviation
SoP SD: Sum of Pairs Standard Deviation
SoPP SD: Sum of Pairs Pathogenic Standard Deviation
SOPB SD: Sum of Pairs Benign Standard Deviation
JS & JSP p: p value for the difference between JSD A and JSDP A
JSP & JSB p: p value for the difference between JSP A and JSB A
SE & SEP p: p value for the difference between SE A and SEP A
SEP & SEB p: p value for the difference between SEP A and SEB A
SoP & SoPP p: p value for the difference between SOP A and SOPP A
SoPP & SoPB p: p value for the difference between SoPP A and SoPB A

Note: rows that are coloured purple were processed using Clustal Omega, rows that are white were processed using Clustal W2

On the iphyton notebook conservation is the same as %fcm on the spreadsheet

####(folder: yeast_replaceable_genes) This project started as a follow up to a paper from the Marcotte lab in which they replaced 414 essential yeast genes with their human counterparts and found that in about half of these strains the genes were "replaceable", meaning that the yeast was able to survive with the human version of the gene. The idea was then to find which of these genes were associated with Mendelian diseases and/or Monogenic diseases and based on sequence alignments between yeast, worm, fly, zebrafish, mouse and human, figure out if disease-causing mutations affect conserved or variable amino acid positions.

More background information here

As a way to test the pipeline I tried to figure out if disease causing mutations affect more conserved amino acids in NPC1, which is the gene that causes Niemann Pick Type C. I got the amino acid sequences from entrez protein and performed an alignment with CLUSTAL W2. Then I got a list of mutations in the NPC1 gene from ClinVar. I got a total of 51 amino acid changes, 6 of which were benign/likely benign. I then searched for the amino acid in which the mutations occur in the sequence alignment and saw whether the amino acid was conserved or not. To see if the amino acids were conserved I used 3 different methods: Jensen Shannon Distribution, Shannon Entropy and Sum of Pairs.

More information and results here

Later I performed the analysis with the genes that were found to be replaceable and non replaceable to see if this was the case with these genes as well or not and if there was a difference between the two groups. Halfway through this I figured out a way to automate more of the process with a python script which you can see here (note - on the script the red coloring based on the amino acid properties DOES NOT work). You can see results here

I am now in the process of performing this analysis in a set of 260 genes that cause mendelian diseases that involve all organelles. Specific gene data is under sequences, then that is divided into replaceable, non replaceable, and other mendelian. When you go to a specific gene you can see a summary of the mutation conservation for the gene, protein sequences, protein alignment, and the conservation scores. You can see the consolidation of the data here and a blog post sumarizing the data for the first 130 genes here

####External tools used

Homologene
Oma Browser
ClustalW2 (retired)
ClustalOmega
ClinVar
Conservation Scoring (developed by Mona Singh's group at Princeton)

##Screens ####(folder: yeast) The folder currently contains images of the yeast were you can see the gene induction (because it is tagged to YFP). There are two folders, each for a different time point (2.5h, 23h). There is a third folder that contains the plate reader data for the first screen at two time points (24h and 48h). We tested 7 different 384 well plates on 3 yeast trains with a duplicate of each (and an extra plate across the 3 strains but no duplicate). Plates labeled A and D are TDP43 WT, B and E are TDP43 M337V and C and F are FUS WT.

More information here

##Acknowledgements All of the work in this repository was performed in collaboration with Ethan Perlstein and Perlstein Lab, primarily while I was interning there.

Special thanks to SciOpen Research Group for helping with acquiring some of the materials for the yeast screens and Professor Aaron Gitler (Stanford University) for letting us use his plasmids.

In memory of Giancarlo Ibarguen.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
conservation_latest_data		conservation_latest_data
literature_search		literature_search
mendelian_conservation		mendelian_conservation
scripts		scripts
yeast		yeast
yeast_replaceable_genes		yeast_replaceable_genes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conservation_latest_data

conservation_latest_data

literature_search

literature_search

mendelian_conservation

mendelian_conservation

scripts

scripts

yeast

yeast

yeast_replaceable_genes

yeast_replaceable_genes

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

materechm/plabData

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages