Drug Sensitivity Prediction From Cell Line-Based Pharmacogenomics Data: Guidelines for Developing Machine Learning Models

Installation

Requirements

Python 3
Conda

To get the source files of PGx Guidelines you need to clone into its repository:

git clone https://github.com/bhklab/PGx_Guidelines

Conda environment

All the required packages to run PGx Guidelines experiments are specified in environment subdirectory. To install these packages run the following command:

conda env create -f PGx.yml

This command installs PGxG environment.

After the successful installation of the packages into environmet you would be able to load them using conda activate.

Datasets

Download datasets

All of the utilized datasets for PGx Guidelines experiments are publicly available in the PSet format via ORCESTRA platform:

https://www.orcestra.ca/pset/stats

Preprocess and load datasets

After downloading PSet objects, the molecular and pharmacological data can be extracted via R using codes provided in Preprocess data subdirectory.

To load all datasets and Area above dose-response curve (AAC) data, run LoadAllPSets.R.

To load log transformed and truncated IC50 values, run IC50Loading_logtruncated.R.

tissueType_encoding.csv file is one-hot coding of tissue types which is added to molecular profiles to adjust for tissue type.

Running R scripts generates the final datasets in .tsv format. Add them to a new subdirectory Data_All:

mkdir Data_All

By creating this subdirectory and adding all the data files to it, you will be able to re-run PGx Guidelines experiments. Alternatively, we have also provided these preprocessed files on Zenodo.

Experiments

Run univariable analysis

Each Rscript includes code to load required libraries and datasets.

Simply run the following for:

all [solid and non-solid] tissues:

Rscript biomarker_analysis_alltissues.R "$@"

after excluding non-solid tissues:

Rscript biomarker_analysis_solidonly.R "$@"

after excluding non-solid tissues and log transformed IC50 values:

Rscript biomarker_analysis_log.R "$@"

after excluding non-solid tissues and truncated

Rscript biomarker_analysis_truncated.R "$@"

after excluding non-solid tissues, truncated, and log transformed IC50 values:

Rscript biomarker_analysis_truncated_log.R "$@"

Within-domain

For this analysis, we have provided the Python scripts as follows:

Ridge Regression: Within-Ridge-aac.py and Within-Ridge-ic50.py

sbatch ridge-wjob-aac.bs
sbatch ridge-wjob-ic50.bs

Elastic Net: Within-EN-aac.py and Within-EN-ic50.py:

sbatch en-wjob-aac.bs
sbatch en-wjob-ic50.bs

Random Forest: Within-RF-aac.py and Within-RF-ic50.py.

sbatch rf-wjob-aac.bs
sbatch rf-wjob-ic50.bs

Cross-domain

For this analysis, we have provided the Jupyter notebooks to run Ridge Regression (Ridge.ipynb), Elastic Net (ElasticNet.ipynb), and Random Forest (RandomForest.ipynb). For Deep Neural Networks experiments, we have provided python scripts in DNN subdirectory to run them. First you should create directories to store logs, models, and results. You should also add your local path to these directories to PGxGRun.bs:

mkdir logs
mkdir models
mkdir results
sbatch PGxGRun.bs

We have also provided randomly generated hyperparameter settings in filelistF10Uniquev1.

We have provided the model objects for the best settings of DNN experiments on Zenodo.

CTRPv2 vs. GDSCv1

For this analysis, we have provided the Jupyter notebook GDSCv1.ipynb.

Impact of non-solid cell lines

For this analysis, we have provided the Jupyter notebook SolidandnonSolid.ipynb. For running the random subset experiment, run SNRidge-aac.py script.

python SNRidge-aac.py

Citation

    author = {Sharifi-Noghabi, Hossein and Jahangiri-Tazehkand, Soheil and Smirnov, Petr and Hon, Casey and Mammoliti, Anthony and Nair, Sisira Kadambat and Mer, Arvind Singh and Ester, Martin and Haibe-Kains, Benjamin},
    title = "{Drug sensitivity prediction from cell line-based pharmacogenomics data: guidelines for developing machine learning models}",
    journal = {Briefings in Bioinformatics},
    year = {2021},
    month = {08},
    issn = {1477-4054},
    doi = {10.1093/bib/bbab294},
    url = {https://doi.org/10.1093/bib/bbab294},
    note = {bbab294},
    eprint = {https://academic.oup.com/bib/advance-article-pdf/doi/10.1093/bib/bbab294/39679532/bbab294.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
CTRPv2vsGDSCv1		CTRPv2vsGDSCv1
Cross_domain		Cross_domain
FeatureSelection		FeatureSelection
Figure		Figure
Preprocess data		Preprocess data
SolidvsNonsolid		SolidvsNonsolid
Within_domain		Within_domain
environment		environment
univariate_stability		univariate_stability
Dockerfile		Dockerfile
README.md		README.md
Supplementary tables.xlsx		Supplementary tables.xlsx
biomarker_analysis_alltissues.R		biomarker_analysis_alltissues.R
biomarker_analysis_log.R		biomarker_analysis_log.R
biomarker_analysis_solidonly.R		biomarker_analysis_solidonly.R
biomarker_analysis_truncated.R		biomarker_analysis_truncated.R
biomarker_analysis_truncated_log.R		biomarker_analysis_truncated_log.R
gene_mapping.R		gene_mapping.R
genes_annotations.xls		genes_annotations.xls

bhklab/PGx_Guidelines

Folders and files

Latest commit

History

Repository files navigation

Drug Sensitivity Prediction From Cell Line-Based Pharmacogenomics Data: Guidelines for Developing Machine Learning Models

Table of contents

Installation

Requirements

Conda environment

Datasets

Download datasets

Preprocess and load datasets

Experiments

Run univariable analysis

Within-domain

Cross-domain

CTRPv2 vs. GDSCv1

Impact of non-solid cell lines

Citation

About

Resources

Stars

Watchers

Forks

Languages