Skip to content

aebk2015/drug-discovery-feature-selection

 
 

Repository files navigation

Drug Discovery Feature Selections

Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017).

Dataset Preparation

Prepared (trainable) datasets are provided in dataset/dataset.tar.gz. Information below are provided as an additional information on how to prepare the dataset from raw sources (*.sdf or .mol2 files).

Required tools:

Extracting positive (label = 1) training data:

  1. Convert sdf to mol2: obabel ../dataset/pubchem-compound-active-hiv1-protease.sdf -O ../dataset/pubchem-compound-active-hiv1-protease_mol2/hiv1-protease.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/pubchem-compound-active-hiv1-protease_mol2/ -file ../dataset/pubchem-compound-active-hiv1-protease.csv

Extracting negative (label = 0) training data:

  1. Convert sdf to mol2: ../dataset/obabel decoys_final.sdf -O ../dataset/decoys_final_mol2/decoys_final.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/decoys_final_mol2/ -file ../dataset/decoys.csv

Extracting test data (unlabeled):

  1. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/HerbalDB_mol2/ -file ../dataset/HerbalDB.csv

Dataset Description

Datasets provided in this repo:

  1. dataset/dataset.tar.gz:
    1. dataset.csv: 3,665 HIV-1 protease inhibitor from PubChem Bioassay + 3,665 protease decoy DUD-E for HIV-1 (Mysinger, Carchia, Irwin, & Shoichet, 2012)
    2. dataset_test.csv: 10 from top 10 protease inhibitor herbal database Indonesia (Yanuar et al., 2014)
  2. dataset/daftar-senyawa-beserta-binding-energy.csv: docking results of 368 molecules from herbal database Indonesia (Yanuar et al., 2014) which are predicted as HIV-1 protease inhibitor by machine learning model in this research

Raw datasets (*.sdf and *.mol2) can be downloaded at https://drive.google.com/open?id=1X_wkpvSLXXXUPbxmFd7tE5pe0t_njMe_

Experiments

Dependency:

  • Python 3.x
  • Python3-tk (on ubuntu sudo apt install python3-tk)
  • Virtualenv (optional. for isolated environment)

Dependency library installation: pip install -r requirements.txt

Steps:

  1. Extract preprocessed data from dataset/dataset.tar.gz (if you have raw csv data, use python 01-prepare-data.py)
  2. Feature selection with SVM-RFE python 02-feature-selection-svm-rfe.py
  3. Feature selection with Wrapper Method (GA + SVM) python 02-feature-selection-wm.py
  4. Evaluate selected features using PubChem dataset python 03-evaluate-1.py
  5. Evaluate selected features using Indonesian Herbal dataset python 03-evaluate-2.py

Evaluation scripts display accuracy scores in console, save raw results in csv files and display result chart(s) to screen

Some Results

PubChem dataset visualizations using t-SNE. Generated by running python visualize-dataset.py:

PubChem t-SNE perplexity=5

PubChem t-SNE perplexity=100

Top 10 PubChem features importance ranking (using Extra Trees):

  1. feature 520 (maxsOH): 0.08817
  2. feature 401 (minsOH): 0.06929
  3. feature 282 (SsOH): 0.02738
  4. feature 110 (nHsOH): 0.02464
  5. feature 163 (nsOH): 0.0211
  6. feature 35 (BCUTw-1l): 0.01432
  7. feature 467 (maxHsOH): 0.01308
  8. feature 406 (minsOm): 0.01307
  9. feature 588 (nAtomP): 0.01254
  10. feature 142 (nsssCH): 0.01231

PubChem Extra Trees feature importance

SVM-RFE also shown that even using 1 feature in PubChem dataset, already give > 80% accuracy. Generated by running python 02-feature-selection-svm-rfe.py:

PubChem Linear SVM + RFE Accuracy per feature set

Comparisons between Linear SVM (no feature selection), Linear SVM + RFE & SVM + Wrapper Method (WM) classification metrics on PubChem dataset. Generated by running python 03-evaluate-1.py:

Receiver Operating Characteristic (ROC) Curves

Classification Accuracy, Sensitivity, Precision and Sensitifity

Citation

@mastersthesis{pujianto2017thesis,
	author={Rahman {Pujianto}},
    title={Drug Candidates Virtual Screening on Indonesian Herbal Plants Database using Machine Learning and Various Feature Selection Strategies},
	school={Universitas Indonesia},
	year={2017},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%