tcrd-similar-targets

Jing Ai and Jung Hoon Son

BINF G4002 Computation Methods

Final Project

**Utilizing TCRD database (used by Pharos) to predict drug targets associated with adverse drug events. **

Compiled Data Set:

Main Data File:

R workspace file, containing all_tcrd data frame for our analysis. This data frame can be generated from Step1_SQL_query_matrix_builder.R file on any computer with a local instance of Pharos/TCRD database installed on MySQL. SQL data dump is available from: (http://juniper.health.unm.edu/tcrd/download/tcrd_v4.4.2.sql.gz)

Individual IDG family based data sets (GPCR, Ion channels, Kinases, NRs, null-classes) have been compiled and is made available in the folder individual_class.

toxic_targets/GoldStandards.csv -This file contains curated information from WITHDRAWN database, implicating drug targets associated with withdrawn drugs around the world (rows where y=0). Together with the protein targets classified as Tclin in TCRD that are not associated with any withdrawn drugs (rows where y=1), they serves as our gold standards for denoting which targets have been implicated with drug adverse events and which have not.

Code Description:

Step1: SQL Query, Vector represention of targets, and Working Matrix

Step1_SQL_query_matrix_builder.R

This code queries each druggable target + additional targets identified via WITHDRAWN database from a local instance of TCRD MySQL database. It expands the feature set with each new target vector representation added.

Step2: Feature selection

analysis_code/DataMappingColsums.R

The file contains code for feature reduction. Since a large number of features contained zeros, we first removed the columns with less than 20 non-zero values (out of 1129 samples).

analysis_code/FeatureSelection.ipynb

This file contains code for feature selection codes (in python). We applied chi-square feature selection and filered out features with chi-square p-values of <0.05

Step2_prelim_prediction.R is used for preliminary randomForest model in R, for quickly testing our test sets.

Step 3: Predictive modeling of drug targets

analysis_code/combined_analysis_JA.R The files contains the modeling codes for L1-Logistic Regression (5-fold CV), Naive Bayes, Random Forrest based on cross validation.

analysis_code/SVM.R The file contains the modeling code for Support Vector Machine (Linear and Kernel).

Step 4: Future work (biologically validating the predicted targets)

future_work/predicting_drugs.R This code can be used to generate an adverse drug scoring system. We query the entire TCRD database for drugs associated with targets (among the 2117 we predicted) predicted by our algorithm. We obtain a score using

Drug_prediction_score.csv is generated by compiling all targets predicted by a exploratory randomForest model, and compiling the following score:

Score = # of Predicted Adverse Targets - # of Predicted non-Adverse targets.

In which drugs with higest score we expect to have highest likelihood of being implicated with an adverse drug effect.

Future work

future_work/plot_playground.R This is an experimental file, used to generate exploratory heatmaps and network visualizations. It currently generates two plots:

heatmap.pdf is a demonstration of heatmaps with K-means clustering applied, showing similar targets based on our vector representation.

correlation_network.pdf is a visualization of correlation matrix of the features. Uninformative, but shows potential.

Results:

results

This folder contains our results and model performance and evaluations. Important feature lists are contained within this folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis_code

analysis_code

archived

archived

future_work

future_work

individual_class

individual_class

results

results

toxic_targets

toxic_targets

README.md

README.md

Step1_SQL_query_matrix_builder.R

Step1_SQL_query_matrix_builder.R

Step2_prelim_prediction.R

Step2_prelim_prediction.R

all_tcrd.Rda

all_tcrd.Rda

all_tcrd_tdl_idgfam.csv

all_tcrd_tdl_idgfam.csv

Repository files navigation

tcrd-similar-targets

Compiled Data Set:

Main Data File:

Code Description:

Step1: SQL Query, Vector represention of targets, and Working Matrix

Step2: Feature selection

Step 3: Predictive modeling of drug targets

Step 4: Future work (biologically validating the predicted targets)

Future work

Results:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
analysis_code		analysis_code
archived		archived
future_work		future_work
individual_class		individual_class
results		results
toxic_targets		toxic_targets
README.md		README.md
Step1_SQL_query_matrix_builder.R		Step1_SQL_query_matrix_builder.R
Step2_prelim_prediction.R		Step2_prelim_prediction.R
all_tcrd.Rda		all_tcrd.Rda
all_tcrd_tdl_idgfam.csv		all_tcrd_tdl_idgfam.csv

plasmak11/tcrd-similar-targets

Folders and files

Latest commit

History

Repository files navigation

tcrd-similar-targets

Compiled Data Set:

Main Data File:

Code Description:

Step1: SQL Query, Vector represention of targets, and Working Matrix

Step2: Feature selection

Step 3: Predictive modeling of drug targets

Step 4: Future work (biologically validating the predicted targets)

Future work

Results:

About

Topics

Resources

Stars

Watchers

Forks

Languages