GitHub - pgniewko/solubility: My (small) research project in solubility of drug-like molecules

Solubility Challange

Notice: This is research code that will not necessarily be maintained in the future. The code is under development so make sure you are using the most recent version. I welcome bug reports and PRs but make no guarantees about fixes or responses.

python make_challenge_prediction.py --model ensemble \
                                    --train_file ../data/training/solubility.uniq.no-in-100.smi \
                                    --test_file ../data/test/test_100.smi \
                                    --out_file ../data/results/ensemble.test_100.preds.dat

Check out your challenge predictions and compare them to the values that could be found in public sources:

python estimate_accuracy.py ../data/test/test_100.with.gse.smi ../results//ensemble.test_100.preds.dat ../data/test/test_100.in-train.smi

Datasets

Note: The training dataset (i.e. all unique SMILES extracted from the raw data) was only mildly curated: (1) filtered out compounds with MolW > 600 or MolW < 60 (2) if multiple measurements are available, compounds with differences larger than 1 log unit or having the opposite signs (e.g. logS0=3 and logS0=-3) were excluded (3) OCHEM db is excluded completely (because of too many dubious datapoints).

Dataset	Do I trust it?	Comments
A.2019.ADMET_DMPK	(+)	Had to get SMILES from name (some failed)
AB.2001.EJPS	(+/-)	Units are not clear to me
ABB.2000.PR	(+/-)	Units are not clear to me
BOM.2017.JC	(+)
D.2008.JCIC	(+)
H.2000.test1	(+)	Downloaded from the website
H.2000.test2	(+)	Downloaded from the website
H.2000.train	(+)	Downloaded from the website
HXZ.2004.JCIC.data_set	(+)	Downloaded from the website
HXZ.2004.JCIC.test_set1	(+)	Downloaded from the website
LGG.2008.JCIM.32	(+)
LGG.2008.JCIM.100	(+)
LPB.2013.JCIC [all]	(-)	Can't understand the format of the data!
POG.2007.JCIM.test	(+)	Data obtained from authors
POG.2007.JCIM.train	(+)	Data obtained from authors
WKH.2007.JCIM.solubility	(+)	ADME website data
WXY.2009.JCIM	(+/-)	Data in SLN format. Set-003 broken.
OCHEM.WaterSolubility	(+/-)	Lots of repeats, some sign error
PubChem	(+/-)	No logS0 data, Measurements at pH=7.4

Papers

Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements?
Antonio Llinàs, Robert C. Glen and Jonathan M. Goodman
J. Chem. Inf. Modeling 2008, 48, 1289-1303
[paper 1]
[paper 2]
[website]
Note 0: This is the reference for the original Solubility Challange
Note 1: In the test set, SMILES strings for probenecid and pseudoephedrine were swapped. Use only soldataswap.xls file.
Note 2: Solubility for 32 compounds taken from HEL.2009.JCIM.pdf
Note 3: Data was downloaded from the original website, but the numbers are dubious (IMO) - use CAREFULLY!
ESOL: Estimating Aqueous Solubility Directly from Molecular Structure
John S. Delaney
J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005
[paper]
Note: There are two files D.2008.JCIC.solubility.v[1-2].txt. These files are the same but come from two different sources: (i) Pat Walters Blog (ii) ChemDB
Can You Predict Solubilities of Thirty-Two Molecules Using a Database of One Hundred Reliable Measurements?
Jarmo Huuskonen J. Chem. Inf. Comput. Sci. 2000, 40, 773-777
[paper]
[website]
Note: Quite a few repeats from Delaney Set. Different measurements, though.
ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach Tingjun Hou, Ke Xia, Wei Zhang, Xiaojie Xu
Journal of Chemical Information and Computer Sciences, 2004, 44, 266-275
[paper]
[website]
Development of reliable aqueous solubility models and their application in drug-like analysis
Junmei Wang, George Krudy, Tingjun Hou, George Holland, Xiaojie Xu
Journal of Chemical Information and Modeling, 2007, 47, 1395-1404
[paper]
[website]
Note: In logS database, the aqueous solubility was expressed as logS, where S is the solubility at a temperature of 20-25°C in mol/L. These are two databases for our modeling. In reference [4], the data afforded by Tetko was used. This database includes 1290 organic compounds. The data set was converted from the SMILES flat file representation to the MACCS/sdf structured data file. In reference [5], some new molecules collected from literature were added. This database includes 1708 molecules.
Can human experts predict solubility better than computers?
Samuel Boobier, Anne Osbourn and John B. O. Mitchell
Journal of Cheminformatics, 2017, 9:63
[paper]
[website]
Note: Source codes accompany the paper.
pH-metric solubility. 3. Dissolution titration template method for solubility determination
Alex Avdeef, Cynthia M. Berger
European Journal of Pharmaceutical Sciences 14 (2001) 281–29
[paper]
pH-Metric Solubility. 2: Correlation Between the Acid-Base Titration and the Saturation Shake-Flask Solubility-pH Methods
Alex Avdeef, Cynthia M. Berger, and Charles Brownell
Pharmaceutical Research, Vol. 17, No. 1, 2000
[paper]
Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information Iurii Sushko et al.,
J Comput Aided Mol Des (2011) 25:533–554
[paper]
[server]
Solubility Challenge revisited after 10 years, with multi-lab shake- flask data, using tight (SD 0.17 log) and loose (SD 0.62 log) test sets
Antonio Llinas, and Alex Avdeef
J. Chem. Inf. Model., 2019
[paper]
Note: The reference for the new challange.
Random Forest Models To Predict Aqueous Solubility
David S. Palmer, Noel M. O’Boyle, Robert C. Glen, and John B. O. Mitchell
J. Chem. Inf. Model. 2007,471, 150-158
[paper]
Note: Data extracted from pdfs
Deep Architectures and Deep Learning in Chemoinformatics
Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi
J. Chem. Inf. Model. 2013,537, 1563-1575
[paper]
Note: Some of the files/data are duplicates
Is Experimental Data Quality the Limiting Factor in Predicting the Aqueous Solubility of Druglike Molecules?
David S. Palmer and John B. O. Mitchell
Mol. Pharmaceutics 2014, 11, 2962−2972
[paper]
Note: Good overview of the sources of the errors in solubility prediction.
Convolutional Networks on Graphs for Learning Molecular Fingerprints
David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P. Adams.
arXiv, 2015:
[paper]
[code]
Note1: Original code in Python 2. In order to make it work use futurize to convert to Python 3
Note2: install with python setup.py install
Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling
Vladimir Svetnik, Andy Liaw, Christopher Tong, J. Christopher Culberson, Robert P. Sheridan, and Bradley P. Feuston
J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958
[paper]
Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection
Cheng, T., Li, Q., Wang, Y., and Bryant, S.H.
Journal of Chemical Information and Modeling, 2011, 51, 229-236
[paper]
Note: The measurements come from BioAssay AID:1996, and are done at pH=7.4. Not very useful for a prediction of logS0.
Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas
Junmei Wang, Tingjun Hou, and Xiaojie Xu
J. Chem. Inf. Model. 2009, 49, 571–581
[paper]
Note: (i) Data in SLN format. CIRpy needed to convert to smiles. (ii) Set-003 looks suspicious, so I excluded it from the train data.
Multi-lab intrinsic solubility measurement reproducibility in CheqSol and shake-flask methods
Alex Avdeef
ADMET & DMPK
[paper]

License

The library is open-source for academic and education users. If you want to use the library in any of your work please cite: Pawel Gniewek, Solubility prediction of drug-like compounds, https://github.com/pgniewko/solubility.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
data		data
papers		papers
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

papers

papers

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Solubility Challange

Table of contents

Solubility

Data preparation and model training

Datasets

Papers

License

About

Releases

Packages

Languages

License

pgniewko/solubility

Folders and files

Latest commit

History

Repository files navigation

Solubility Challange

Table of contents

Solubility

Data preparation and model training

Datasets

Papers

License

About

Resources

License

Stars

Watchers

Forks

Languages