Aerobic biodegradation prediction with XGBoost Classifier

A machine learning model for the aerobic biodegradation prediction (classification) based on XGBoost as the major ML algorithm and MACCS fingerprint as the chemical representation. The repository contains all the related datasets, codes, and model files.

An online predictor was published on Aropha AI at https://www.ai.aropha.com/aerobic-biodegradation/classification/about.html

Basic information

Dataset

The classification model was based on more than 3,000 data points with SMILES strings as the inputs and the class (0 or 1) as the output. Only ready biodegradation data with time of 28 and principles of closed bottle test, closed respirometer, and CO2 evolution were considered.

ML algorithms

A total of 14 ML algorithms were examined to find the best one, including K nearest neighbors, Linear support vector machine (SVM), Radial basis function SVM (RBF SVM), Gaussian process, Neural net multi-layer perceptron classifier, Decision tree, Random forest, Bagging, Adaptive boosting, Gradient boosting, XGBoost, Extra tree, Gaussian Naive Bayes, Quadratic discriminant analysis.

XGBoost was found to be the best one.

Chemical representation

MACCS fingerprints

Other notes

Data balancing was performance as the two classes were not well balanced. Bayesian optimization was conducted for tuning the model hyperparameters. Chemical similarity calculation was performed using the fingerprint similarity based on Tanimoto index to determine the model applicability domain.

Explanation of each folder

example-smiles-files: Files containing example SMILES strings that you can directly use for prediction.
example: An example JupyterNotebook that can guide you through the library import, data preparation, prediction, and result saving.
model-data: The original data we used for building the model.
models: The model XGBoost model file you can directly use once loaded (model = pickle.load(open(model, 'rb'))).

Use the online predictor on Aropha AI

Address: https://www.ai.aropha.com/aerobic-biodegradation/classification/about.html

Download files to run locally

In addition to using the online predictor, we also encourage you to try the model files locally with your data to have command-line controls over the prediction.

Dependencies

RDkit: Draw molecules and convert smiles to fingerprints.
Numpy: Create matrices and mathematical operations.
Pandas: Data manipulation.
Scikit-learn: Framework to perform ML models.
XGBoost: Perform a XGBoost model.
Pickle: Load the model files.

Install the dependencies

RDKit

The installation of RDKit using pip had been challenging. However, the recent update made it super simple with the following command:

pip install rdkit-pypi

or traditionally, using conda:

conda install -c conda-forge rdkit

Others

pip install numpy
pip install pandas
pip install -U scikit-learn
pip install xgboost
pip install pickle-mixin

Download the model file and follow the JupyterNotebook

You can simply download the model file in the models folder and follow the JupyterNotebook in the example folder to run the models for your predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
doc		doc
example-smiles-files		example-smiles-files
example		example
model-data		model-data
models		models
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

example-smiles-files

example-smiles-files

example

example

model-data

model-data

models

models

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Aerobic biodegradation prediction with XGBoost Classifier

Basic information

Dataset

ML algorithms

Chemical representation

Other notes

Explanation of each folder

Use the online predictor on Aropha AI

Download files to run locally

Dependencies

Install the dependencies

RDKit

Others

Download the model file and follow the JupyterNotebook

About

Releases

Packages

Languages

License

KuanHuang/aerobic-biodeg-xgb-classifier

Folders and files

Latest commit

History

Repository files navigation

Aerobic biodegradation prediction with XGBoost Classifier

Basic information

Dataset

ML algorithms

Chemical representation

Other notes

Explanation of each folder

Use the online predictor on Aropha AI

Download files to run locally

Dependencies

Install the dependencies

RDKit

Others

Download the model file and follow the JupyterNotebook

About

Resources

License

Stars

Watchers

Forks

Languages