GitHub - SchlossLab/Topcuoglu_ML_mBio_2020: Best practices for applying machine learning to bacterial 16S rRNA gene sequencing data

A framework for effective application of machine learning to microbiome-based classification problems

Abstract

Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made towards developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs; n=490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1 and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, decision trees, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an AUROC of 0.695 [IQR 0.651-0.739] but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 [IQR 0.625-0.735], trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.

Overview

project
|- README         		# the top level description of content (this doc)
|- CONTRIBUTING    		# instructions for how to contribute to your project
|- LICENSE         		# the license for this project
|
|- data/           		# raw and primary data, are not changed once created
| |- process/     		# .tsv and .csv files generated with main.R that runs the models
| |- baxter.0.03.subsample.shared      	# subsampled mothur generated file with OTUs from Marc Sze's analysis
| |- metadata.tsv     		        # metadata with clinical information from Marc Sze's analysis 		
|- code/          			# any programmatic code
| |- learning/    			# generalization performance of model
| |- testing/     			# building final model
|
|- results/        			# all output from workflows and analyses
| |- tables/      			# tables and .Rmd code of the tables to be rendered with kable in R
| |- figures/     			# graphs, likely designated for manuscript figures
|
|- submission/
| |- manuscript.Rmd 			# executable Rmarkdown for this study, if applicable
| |- manuscript.md 			# Markdown (GitHub) version of the *.Rmd file
| |- manuscript.tex 			# TeX version of *.Rmd file
| |- manuscript.pdf 			# PDF version of *.Rmd file
| |- header.tex 			# LaTeX header file to format pdf version of manuscript
| |- references.bib 			# BibTeX formatted references
|
|- Makefile	 # Reproduce the manuscript, figures and tables

How to use the outlined ML pipeline for your own project

Please go to https://github.com/SchlossLab/ML_pipeline_microbiome.
The current repository is to reproduce the manuscript but the provided link will take you to a user-friendly version of our pipeline.

How to regenerate this repository in R

Please take a look at the Makefile for more information about the workflow. Please also read the submission/manuscript.pdf to get a more detailed look on what we achieve with this ML pipeline.

Clone the Github Repository and change directory to the project directory.

git clone https://github.com/SchlossLab/Topcuoglu_ML_XXX_2019.git
cd DeepLearning

Our dependencies:
- R version 3.5.0
- The R packages which needs to be installed in our environment: caret ,rpart, xgboost, randomForest, kernlab,LiblineaR, pROC, tidyverse, cowplot, ggplot2, vegan,gtools, reshape2.
- Everything needs to be run from project directory.
- We need to download 2 datasets (OTU abundances and colonoscopy diagnosis of 490 patients) from *Sze MA, Schloss PD. 2018. Leveraging existing 16S rRNA gene surveys to identify reproducible biomarkers in individuals with colorectal tumors. mBio 9:e00630–18.
- We update the caret package with my modifications by running (Take a look at this script to change the R packages directory where caret is installed.):
  
  Rscript code/learning/load_caret_models.R
These modifications are in data/caret_models/svmLinear3.R and data/caret_models/svm_Linear4.R
Follow the Makefile to generate the manuscript.
- The Makefile uses code/learning/main.R to run the pipeline which sources 4 other scripts that are part of the pipeline.
  - To choose the model and model hyperparemeters:code/learning/model_selection.R'
  - To preprocess and split the dataset 80-20 and to train the model: code/learning/model_pipeline.R
  - To save the results of each model for each datasplit: code/learning/generateAUCs.R
  - To interpret the models: code/learning/permutation_importance.R

Name		Name	Last commit message	Last commit date
Latest commit History 1,475 Commits
code		code
data		data
results		results
submission		submission
.gitignore		.gitignore
Decision_Tree.pbs		Decision_Tree.pbs
DeepLearning.Rproj		DeepLearning.Rproj
L1_Linear_SVM.pbs		L1_Linear_SVM.pbs
L2_Linear_SVM.pbs		L2_Linear_SVM.pbs
L2_Logistic_Regression.pbs		L2_Logistic_Regression.pbs
LICENCE.md		LICENCE.md
Makefile		Makefile
RBF_SVM.pbs		RBF_SVM.pbs
README.md		README.md
Random_Forest.pbs		Random_Forest.pbs
XGBoost.pbs		XGBoost.pbs
re_do_linear_svm.pbs		re_do_linear_svm.pbs
run_main_python.pbs		run_main_python.pbs

License

SchlossLab/Topcuoglu_ML_mBio_2020

Folders and files

Latest commit

History

Repository files navigation

A framework for effective application of machine learning to microbiome-based classification problems

Abstract

Overview

How to use the outlined ML pipeline for your own project

How to regenerate this repository in R

About

Topics

Resources

License

Stars

Watchers

Forks

Languages