Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data

The emergence of microbial antibiotic resistance is a global health threat. In clinical settings, the key to controlling spread of resistant strains is accurate and rapid detection. As traditional culture-based methods are time consuming, genetic approaches have recently been developed for this task. The detection of antibiotic resistance is typically made by measuring a few known determinants previously identified from genome sequencing, and thus requires the prior knowledge of its biological mechanisms. To overcome this limitation, we employed machine learning models to predict resistance to 11 compounds across four classes of antibiotics from existing and novel whole genome sequences of 1936 E. coli strains. We considered a range of methods, and examined population structure, isolation year, gene content, and polymorphism information as predictors. Gradient boosted decision trees consistently outperformed alternative models with an average accuracy of 0.91 on held-out data (range 0.81-0.97). While the best models most frequently employed gene content, an average accuracy score of 0.90 could be obtained using population structure information alone. Single nucleotide variation data were less useful, and significantly improved prediction only for two antibiotics, including ciprofloxacin. These results demonstrate that antibiotic resistance in E. coli can be accurately predicted from whole genome sequences without a priori knowledge of mechanisms, and that both genomic and epidemiological data can be informative. This paves way to integrating machine learning approaches into diagnostic tools in the clinic.

This package includes machine learning toolkit used for prediction. The models include a regularized logistic regression, random forests, gradient boosted decision tree and deep learning.

Installation

There are three ways to run the tool:

The package may be downloaded and run as a binary file, ./PlasmidPred.bin.
The tool is available on DockerHub and may be fetched and run using the following commmands:

docker pull daneshmoradigaravand/panpred:latest
docker run -v $PWD:/data --rm -it panpred ./PanPred.py -h

The python file may be executed directly, using the following command:

python3 PanPred.py

Manual

The tools is initiated using the binary command. The help instruction is called using -h option. The tool has two functionalities: preprocesssing and modelling.

Usage: PanPred <command> [options]

Command:
		    preprocess	    preprocess and creates the NGS input data for Machine Learning tools
		    predict_LR _RF _GB _DL	    Calls a Machine Learning model, i.e. Logistic regression, RandomForests, GradientBoosted and Deep Learning, to predict resistance from input

Preprocessing

The preprocess step prodduces the input file for the modeling part.

Usage:
PanPred preprocess [options]

Options:
        -p STR   path to csv and Rtab files (Roary output files), default value: current path
        -c STR   create: create input file,  encode: label encoder for population structure, dedup: prepare Roary output
        -d STR   drug id number in the metadata file
        -i STR   input model ['GY', 'GYS', 'G','GS', 'S', 'SY']
        
        -m STR   name of the metadata file 
        -g STR   name of the accessory gene file
        -s STR   name of the population strcuture file
        
        -r STR   name of Rtab file from Roary output (.Rtab)
        -a STR   name of gene presence absence file from Roary output (.csv)

The G, Y and S options stand for gene presence and absence pattern, year and population strcture. The input for pangenoe should be in the format of the Rtab file from Roary.

Modelling

The modelling toolkit comprises four models, callable using the perdicts options.

Logistic regression based prediction

python PanPred.py predict_LR
0.0
Calls a Machine Learning modelLogistic regression
Usage:
PanPred predict [options]

Options:
        -p STR    path to input file and destination for output file 
        -i STR    input file name
        -g FLT    L2 penalty
        -r FLT    train/test ratio

Random forest based prediction

python PanPred.py predict_RF
None
Calls a Machine Learning model RandomForests
Usage:
PanPred predict [options]

Options:
        -p STR    path to input file and destination for output file 
        -i STR    input file name
        -r FLT    train/test ratio

Gradient boosted decision trees based prediction

python PanPred.py predict_GB
None
Calls a Machine Learning model GradientBossting 
Usage:
PanPred predict [options]

Options:
        -p STR    path to input file and destination for output file 
        -i STR    input file name
        -r FLT    train/test ratio

Deep Learning based prediction

python PanPred.py predict_DL
Calls a Machine Learning model Deep Learning
Usage:
PanPred predict [options]

Options:
        -p STR    path to input file and destination for output file 
        -i STR    input file name
        -r FLT    train/test ratio
        
        -d FLT    drop_out
        -n INT   number of nodes in the first layer
        -m INT   number of nodes in intermediate layers
        -l INT    number of layers

Note for random forests and gradient boosted decision trees optimal parameters reported in the paper are used.

Supplemental files

The test_data directory contains input files and basic ML commands.

/test_data: input data used in the manuscript
/Rcode: R code for population structure matrix generator.
CARD_resistance CARD resistance results
ResFinder_resistance Resfinder resistance results
GB_tuning.csv GB (Gradient boosted decision trees) hyperparameter tuning results
LG_tuning.csv LG (Logistic Regression) hyperparameter tuning results
NN_tuning.csv NN (Deep Learning) hyperparameter tuning results
RF_tuning.csv RF (Random Forests) hyperparameter tuning results

Esternal files are found

Contact

For queries, please contact Danesh Moradigaravand, Data-Driven Microbiology lab, Center for Computational Biology, University of Birmingham.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
PanPred.build		PanPred.build
Rcode		Rcode
__pycache__		__pycache__
predict		predict
preprocess		preprocess
test_data		test_data
.DS_Store		.DS_Store
CARD_resistance		CARD_resistance
Dockerfile		Dockerfile
GB_tuning.csv		GB_tuning.csv
LG_tuning.csv		LG_tuning.csv
NN_tuning.csv		NN_tuning.csv
PanPred.bin		PanPred.bin
PanPred.py		PanPred.py
README.md		README.md
README.txt		README.txt
RF_tuning.csv		RF_tuning.csv
ResFinder_resistance		ResFinder_resistance
__init__.py		__init__.py
requirements.txt		requirements.txt

DaneshMoradigaravand/PanPred

Folders and files

Latest commit

History

Repository files navigation

Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data

Table of contents

Citation

Introduction

Installation

Manual

Preprocessing

Modelling

Logistic regression based prediction

Random forest based prediction

Gradient boosted decision trees based prediction

Deep Learning based prediction

Supplemental files

Contact

About

Resources

Stars

Watchers

Forks

Languages