GitHub - wvictor14/team_Methylation-Badassays: STAT 540 Spring 2017 team repository

The Methylation Badassays

Introduction:

DNA methylation (DNAm) is linked to many diseases like cancer and autism. However DNAm marks can change due to environmental stimuli, cell types, gender, etc. In recent years, several DNAm studies have suggested that a large portion of DNAm variability is associated with genetic ancestry and is heritable, making DNAm a potential confounding factor which is not given enough consideration in the context of DNA methylation analysis. Differentially methylated CpG sites associated with pathology can be confounded by CpGs associated with genetic ancestry causing suprious results. Therefore, to invetigate how DNA methylation affects prenatal health, it is important for us to identify genetic ancestry-associated CpGs to figure out true positives. This DNAm variability in the placenta due to genetic ancestry needs to be accounted for in large scale DNAm studies, or else no meaningful interpretation of results can be done to assess prenatal health. In this project, we are going to investigate if DNA in placental tissue is differentially methylated across populations of different ancestry.

Hypothesis: DNA in placental tissue is differentially methylated across populations of different ancestries.

We will first find methylation profiles in subjects from our dataset 1 and the genetic ancestry (Asians or Caucasians) of our subjects is known. These profiles will then serve as a basis to cluster methylation data in our dataset 2 in which genetic ancestry is not known. For both dataset 1 and dataset 2, DNAm was measured by 450K microarray from Illumina.

For the details of the project ideas, dataset and methods we used for this project, please check our project proposal.

Team Member:

Name	Department/Program	GitHub ID
Victor Yuan	Genome Science and Technology	@wvictor14
Michael Yuen	Medical Genetics	@myuen89
Nivretta Thatra	Bioinformatics	@nivretta
Ming Wan	Statistics	@MingWan10
Anni Zhang	Genome Science and Technology	@annizubc

Workflow

This figure summarizes the workflow of our project:

For all our processing steps below, please see the Results folder for a more detailed write up on our findings from our analyses. If you're interested in the code, see the markdown files in the Scripts folder.

Preprocessing and Normalization

We first used this script to process (via quality control, filtering, and normalization) the raw data of dataset 1 into to our processed data. For detailed information of dataset 1, please see Metadata.

Exploratory Analysis

We explored our data by generating sample-sample correlation heatmaps, plotting a few random CpGs and plotting the first few principal components.

Differential Methylation Analysis

We used the R package limma to identify differentially methylated probes between Asian and Caucasian samples. Please see our differential DNA methylation analysis script in the scripts folder for the code and details. Limma prioritized 13 CpG sites that are differentially methylated between Caucasian and Asian genetic ancestry using a cutoff off p value = 0.01.

Building an Ancestry Classifer

To build the DNA methylation ancestry classifer, we compare SVM and elastic net logistic regression (glmnet) models. We ended up choosing glmnet for building the final model, and used a nested cross validation strategy to tune the penalization parameters, and for estimating the test error. After generating the final model, we analyzed the predictors, and examined the results of the predictions on the secondary unlabelled dataset. Please see the subdirectory predictive modeling for the markdown files and details.

Brief Functional Analysis

We looked the 13 CpG sites prioritized by limma and the 11 CpG sites prioritized by glmnet, in this script. Using the COHCAP (City of Hope CpG Island Analysis Pipeline) package, the CpGs we mapped to chromosome, location, gene name and CpG island information. Each gene was annotated with its GO term using the package mygene.

Summary

Please see our poster! 😄
SVM performed slightly better than glmnet (for both training and testing error)
Final model used 11 CpG predictors and was built with glmnet with a AUC of 0.981 and 0.977+-0.024 for training and testing error respectively (α = 0.75, λ = 0.25).
The classifier predicted all of the unlabeled test set to Caucasian, which we doubt is the true case.
We suspect the test set is too ‘different’ from the training data set for the classifier to perform accurately on the test set

Future Direction

Normalizing and QCing the test and training datasets together may be necessary for DNA methylation classifiers to perform well.
Using MDS ancestry coordinates from population stratification meta-analyses may provide ‘labels’ to assess classifier performance or iprove model building. (self-reported ancestry can be unreliable)

Project proposal: includes the introduction to the ideas, dataset and methods we used in this project.
Progress report: contains the progress about our project.
Data folder contains metadata, raw data, and processed data.
- Metadata
  - human placental tissue from 45 subjects with self reported ancestry
  - columns correspond to subject ancestry, name, sex, gestational age and what complications they had in pregnancy (none, intrauterine growth (IUGR) restriction, or late onset preeclampsia (LOPET), neither of which affect DNAm)
  - columns for Sentrix ID and position correspond to the sample’s batch ID and position on the Illumina microarray
  - each row is one subject.
- Raw data for dataset 1.
- Processed data this folder contains the processed data processed from raw data.
Scripts folder contains the script for:
- Preprocessing: processing the raw data
- Exploratory Analysis: explores our processed training data to see if there are any obvious underlying structure.
- Differential Methylation Analysis: done using limma on the processed data.
- Building the classifer: This script is for building the ancestry classifier, as well as for the analysis of the resulting predictor CpGs. This folder also contains the script to run the classifier and analyze those results on the second dataset, whose genetic ancestry is unknown.
- Comparing SVM vs glmnet: This script was used to compare glmnet and SVM.
- Functional Analysis: for the functional analysis of the CpG sites prioritized by glmnet and limma.
Results contain a summary of our main findings.
Poster

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
Data		Data
Results		Results
Scripts		Scripts
.gitignore		.gitignore
README.md		README.md
poster.pdf		poster.pdf
progress_report.md		progress_report.md
project_proposal.md		project_proposal.md
team_Methylation-Badassays.Rproj		team_Methylation-Badassays.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

Results

Results

Scripts

Scripts

.gitignore

.gitignore

README.md

README.md

poster.pdf

poster.pdf

progress_report.md

progress_report.md

project_proposal.md

project_proposal.md

team_Methylation-Badassays.Rproj

team_Methylation-Badassays.Rproj

Repository files navigation

The Methylation Badassays

Introduction:

Team Member:

Workflow

Preprocessing and Normalization

Exploratory Analysis

Differential Methylation Analysis

Building an Ancestry Classifer

Brief Functional Analysis

Summary

Future Direction

Table of contents:

About

Releases

Packages

Contributors 6

Languages

wvictor14/team_Methylation-Badassays

Folders and files

Latest commit

History

Repository files navigation

The Methylation Badassays

Introduction:

Team Member:

Workflow

Preprocessing and Normalization

Exploratory Analysis

Differential Methylation Analysis

Building an Ancestry Classifer

Brief Functional Analysis

Summary

Future Direction

Table of contents:

About

Topics

Resources

Stars

Watchers

Forks

Languages