Releases: chemprop/chemprop
v2.0.0 Stable Release
This is the first stable release of Chemprop v2.0.0, with updates since the v2.0.0-rc.1 release candidate in early March.
The primary objectives of v2.0.0 are making Chemprop more usable from within Python scripts, more modular, easier to maintain and develop, more compute/memory efficient, and usable with PyTorch Lightning. Some features will not be migrated from v1 to v2 (e.g. web, sklearn). Some v1 features will be added in later versions of v2 (v2.1+) (e.g. uncertainty, interpret, atom- and bond-targets); see milestones here. The new version also has substantially faster featurization speeds and much higher unit test coverage, enables training on multiple GPUs, and works on Windows (in addition to Linux and Mac). Finally, the incorporation of a batch normalization layer is expected to result in smoother training and improved predictions. We encourage all Chemprop users to try using v2.0.0 to see how it can improve their workflows.
v2 documentation can be found here.
There are v2 tutorial notebooks in the examples/
directory.
A helpful transition guide from Chemprop v1 to v2 can be found here. This includes a side-by-side comparison of CLI argument options, a list of which arguments will be implemented in later versions of v2, and a list of changes to default hyperparameters.
Note that if you install from source, the primary branch of our repository has been renamed from master
to main
.
Due to development team bandwidth, Chemprop v1 will no longer be actively developed, so that we can focus our efforts on v2. Bug reports and questions about v1 are still welcome to benefit users who haven't yet made the switch to v2, but bug reports will not be fixed by the development team.
Please let us know of any bugs you find, questions you have, or enhancements you want in Chemprop v2 by opening an issue.
Final Patch for Version 1
This is the final release of chemprop v1. All future development will be done on chemprop v2. The development team is still happy to answer questions about v1, but no new feature requests or PRs for v1 will be accepted. Users who identify bugs in v1 are still encouraged to open issues to report them - they will be tagged as v1-wontfix
to signify that we won't be publishing fixes for them in official chemprop releases, but the bugs can still be open to community discussion.
We encourage all users to try migrating their workflows over to chemprop v2 (available now as a release candidate, stable version planned to be released within the next week) and let us know of any issues you encounter. All v1 releases will remain available on PyPI, and the v1 source code will remain available in this GitHub organization.
What's Changed
- fix the
uncal_vars
for atom/bond property prediction by @shihchengli in #712 - [v1]: Add Docker Image Building Action and Official Images to DockerHub by @JacksonBurns in #718
- remove macos and windows from v1 ci by @JacksonBurns in #720
- update docker build
if
to use correct upstream branch name by @JacksonBurns in #723 - fix the task names by @shihchengli in #725
- Fixed typo in README.md by @willspag in #745
New Contributors
Full Changelog: v1.7.0...v1.7.1
v2.0.0 Release Candidate
This is a release candidate for Chemprop v2.0.0, to be released in April 2024.
The primary objectives of v2.0.0 are making Chemprop more usable from within Python scripts, more modular, easier to maintain and develop, more compute/memory efficient, and usable with PyTorch Lightning. Some features will not be migrated from v1 to v2 (e.g. web, sklearn). Some v1 features will be added in later versions of v2 (v2.1+) (e.g. uncertainty, interpret, atom- and bond-targets); see milestones here. The new version also has substantially faster featurization speeds and much higher unit test coverage, enables training on multiple GPUs, and works on Windows (in addition to Linux and Mac). Finally, the incorporation of a batch normalization layer is expected to result in smoother training and improved predictions. The label as a “release candidate” reflects its availability to be downloaded via PyPI and that only minor changes are expected for the Python API before the final release. We expect most remaining changes before the release of v2.0.0 in April to be focused on additional improvements to the command line interface (CLI), which does not yet have feature parity with v1. We encourage all Chemprop users to try using v2.0.0-rc.1 to see how it can improve their workflows.
The v2 documentation can be found here.
There are tutorial notebooks for v2 in the examples/ directory.
A helpful transition guide from v1 to v2 can be found here. This includes a side-by-side comparison of CLI argument options, a list of which arguments will be implemented in later versions of v2, and a list of changes to default hyperparameters.
You can subscribe to our development status and notes for this version: #517.
Ongoing work for this version is available on the v2/dev
branch.
Please let us know of any bugs you find by opening an issue.
Conformal Calibration
What's Changed
- new split per molecular weight by @soulios in #456
- Specify license for Chemprop logos by @mliu49 in #461
- Add
todo.md
by @davidegraff in #492 - Update authors list in license file and alphabetically sort by @cjmcgill in #532
- update authors in LICENSE and setup files for v1 by @kevingreenman in #533
- Fix Transpose bug in Inequality Regression by @cjmcgill in #308
- Add Dirichlet Evidential Uncertainty Quantification by @cjmcgill in #423
- New metrics by @soulios in #542
- Updating README with ADMET-AI details by @swansonk14 in #554
- Improve error message when gilbrat is needed. by @KnathanM in #569
- limit chempropv1 python version to 3.7, 3.8 only by @JacksonBurns in #618
- Add a
CITATIONS.bib
by @JacksonBurns in #627 - Limit Maximum Allowed
flask
Version in v1 by @JacksonBurns in #628 - move num_unc_tasks definition to ensure always defined by @kevingreenman in #632
- Switching np.mean to np.nanmean to handle NaN metrics by @swansonk14 in #453
- Fix the dtype for targets of different sizes by @shihchengli in #638
- Add setters for atom and bond constraints by @shihchengli in #637
- switch v1 readthedocs build from conda to mamba by @kevingreenman in #660
- Fix v1 docs theme by @kevingreenman in #669
- Conformal Calibration by @danielxu9393 in #304
- add note on feature releases and instructions for ssl+ddp by @JacksonBurns in #685
- remove unnecessary argument for reshape function by @shihchengli in #671
- Fix atom/bond property prediction with atom-mapped SMILES and target classification by @shihchengli in #673
- Pass num_workers to MoleculeDataLoader during interpretation by @kevingreenman in #691
- conformal quantile prediction bug fix by @shihchengli in #693
New Contributors
- @soulios made their first contribution in #456
- @danielxu9393 made their first contribution in #304
Full Changelog: v1.6.1...v1.7.0
Bug fix for reaction atom mapping
Bug fix
PR #383 unexpectedly broke the atom mapping for reaction mode. The issue is described in Issue #426 and fixed by PR #427.
What's Changed
- Fix versioning issues - metadata and dependencies by @kevingreenman in #420
- add job to tests action for PyPI package by @JacksonBurns in #422
- added chemprop manuscript to readme by @hesther in #425
- Keep Support for Python 3.7 and 3.8 when fixing
gilbrat
Issue by @JacksonBurns in #431 - Fix reaction atom mapping by @shihchengli in #427
Full Changelog: v1.6.0...v1.6.1
Atomic/bond targets prediction
Major New Features
- Atomic/bond targets prediction by @shihchengli in #280
What's Changed
- Replace multiclass mcc with 1-mcc for loss by @cjmcgill in #332
- Add chemprop logo by @shihchengli in #339
- Add CodeQL workflow for GitHub code scanning by @lgtm-com in #344
- Add to the description of evidential regularization by @cjmcgill in #353
- Remove deprecated numpy float types by @cjmcgill in #357
- Correct a bug in ENCE uncertainty evaluation by @cjmcgill in #360
- Hyperopt Parallel Race Conditions and Manual Trial Load by @cjmcgill in #307
- Simplified install with PyPI
rdkit
and git install insetup.py
by @JacksonBurns in #364 - Allow providing both loaded features and a features generator by @shihchengli in #318
- For any multiclass task,
make_predictions
fails if option --individual_ensemble_predictions is on. by @piotr-semenov in #354 - Save loaded molecular features into .npy files by @shihchengli in #337
- Ignore invalid atom-mapped SMILES by @shihchengli in #367
- Molecule fingerprinting with invalid SMILES in list by @shihchengli in #351
- change calibration_features_path from str to List[str] by @ceroth in #358
- Change logo style by @shihchengli in #369
- Clamp evidential 'v' parameter by @kevingreenman in #371
- fix colab demo by @kevingreenman in #368
- Avoid OverflowError when setting field size to sys.maxsize by @shihchengli in #373
- Set atom and bond constraints when loading model by @shihchengli in #374
- Readme updates by @kevingreenman in #385
- Remove atom map numbers for scaffold splits by @shihchengli in #383
- update bug report template - ask for full stack trace by @kevingreenman in #401
- Fix t-SNE script by @kevingreenman in #403
- Fixing skipped lines in csv writing when using a windows computer by @cjmcgill in #406
Full Changelog: v1.5.2...v1.6.0
Flexible hyperparameter search, missing uncertainty target values, evaluation of different magnitude multitask targets, empty test set assignment, and DockerFile updates
Features
Flexible hyperparameter search space
The parameters to be included in hyperparameter optimization can now be selected using the argument --search_parameter_kewords {list-of-keywords}
. The parameters supported are: activation, aggregation, aggregation_norm, batch_size, depth, dropout, ffn_hidden_size, ffn_num_layers, final_lr, hidden_size, init_lr, max_lr, warmup_epochs. Some special kewords are also included for groups of keywords or different search behavior: basic, learning_rate, all, linked_hidden_size.
PR #299
Missing targets in uncertainty calibration datasets
Added capabilities to the uncertainty calibration and evaluation methods to allow them to handle missing target values in multitask jobs. This capability was already included in the normal training of models, now implemented in uncertainty calibration and evaluation.
PR #295
Issue #292
Multitask evaluation for tasks of different magnitudes
When evaluation metrics tend to scale with the magnitude of a task (e.g., rmse), averaging metrics between tasks has been replaced with a geometric mean function. This makes the average metric in multitask regression jobs be less dominated by large magnitude targets. This was previously an issue for hyperparameter optimization and the evaluation of optimal epoch during model training, though the calculation of loss for gradient descent is on scaled targets and was already not scale dependent.
PR #290
Empty test set allowed
An empty test split can now be used during training. This was previously possible only using the cv-no-test
split method, but now it is available more widely when specifying split sizes, for example with --split_sizes 0.8 0.2 0
.
PR #284, #260 related
Issue #279
Updates to conda environment and docker file
Conda environment building will now prefer to use the pytorch channel over the conda-forge channel. The Dockerfile has been updated to use micromamba, allowing for faster environment solves than conda and removing a potential licensing issue.
PR #276
Bug Fixes
Fix MCC loss for multiclass jobs
Corrected a calculation problem in the loss function that was returning infinite loss inappropriately. Also adopted the convention of returning loss of zero when infinite loss is returned, as often happens in very unbalanced datasets. Added appropriate unit testing.
PR #309
Issue #306
Correct code error in ence uncertainty evaluation
Corrects an error in the ence uncertainty evaluation method that made that method unusable. Bug was introduced during PR #305.
PR #302
Issue #301
Fixed link to MoleculeNet website
Corrected the link to the MoleculeNet benchmark dataset website in the readme, following MoleculeNet migrating to a new site location.
PR #296
Multitarget uncertainty calibration mve weighting method
Previously, this method only worked for single task jobs, now has been extended to work for multitask models as well.
PR #291
Remove unused verion.py file
Version tracking in Chemprop no longer uses the version.py file and it was removed.
PR #283
Multiclass argument typo in readme
Corrected a typo where the number of classes used in multiclass regression should have been indicated as --multiclass_num_classes
.
PR #281
Repair individual ensemble predictions
Refactoring of prediction file during the addition of uncertainty functions disabled the option to return the individual predictions of each member of an ensemble of models. Option is now available again.
PR #274
Quick Fix to Uncertainty Evaluation
Bugfix
Inconsistent Path For Uncertainty Evaluation
Fixed a bug in uncertainty evaluation where the uncertainty evaluator was using the path name originally used to train a checkpoint. This made the uncertainty evaluator only work in the case that the test data and training data used in initial model training had the same path.
Uncertainty Functions, Reaction-Solvent Models, Loss Function Options, Keyed Splitting, and Chemprop Colab Demo
Features
Uncertainty Tools
Tools added for uncertainty quantification, calibration, and evaluation as part of the chemprop predict function. Uncertainty predictions are saved as part of the predictions file. Uncertainty functions and outputs are triggered using the arguments --uncertainty_method {method}
.
Uncertainty outputs can be calibrated using an outside dataset (evaluation set from training is often suitable) in order to have better uncertainty estimates on new predictions. Can be activated using --calibration_method {method}
and --calibration_path {path-to-csv}
. For the regression dataset type, a calibrated output can provide either a standard deviation or one-sided interval bound, as set with the options --regression_calibrator_metric {stdev-or-interval}
and --calibration_interval_percentile {int}
.
If the data file containing smiles for the test path also contains target values, the uncertainty performance can be evaluated using various metrics, activated with the option --evaluation_methods {list-of-methods}
.
Internally, this PR creates several classes for carrying out prediction tasks: UncertaintyEstimator, UncertaintyPredictor, UncertaintyCalibrator, UncertaintyEvaluator. Loss functions have been added that have auxiliary uncertainty outputs, mve
and evidential
for regression.
PR #267
PR #269
Reaction-Solvent Option
Gives the option to train a chemprop model using one reaction and one molecule for each datapoint. Active when used with the option --reaction_solvent
. Options for making the solvent mpnn use different parameters than that for the reaction are possible using --bias_solvent
, --hidden_size_solvent {int}
, and --depth_solvent {int}
.
PR #246
Multimolecule Fingerprinting
Added some new changes for fingerprint functions with multiple molecules. Models trained with a "shared-mpn" between two molecules can return a MPN fingerprint with only one molecule provided. Also, when multiple molecule models are used for MPN fingerprint generation, the output will indicate which molecule each element belongs to.
PR #242
Issue #236
Colab Notebook Examples
Created a Jupyter notebook that runs examples of Chemprop jobs, specifically as the functions can be used in python. Good resource for new users, demonstrations, or tutorials. Linked to Google Colab so that it can be run remotely, not requiring any local install of Chemprop.
PR #239
PR #273
Loss Function Options
Previously, loss functions were selected automatically based on the dataset type being used in model training. Now the loss function can be selected with --loss_function {function}
. Some new specialty loss functions have been added with this capability.
- Matthews Correlation Coefficient (
mcc
) is a loss function for classification and multiclass that considers True Positives, True Negatives, False Positives, False Negatives separately in the loss function, avoiding domination by one class and making it well suited to unbalanced training sets. - Bounded Mean Squared Error (
bounded_mse
) is a regression loss function that allows for training targets expressed as inequalities, e.g. ">5.0". Intended for use with experimental data with delimited ranges. - Mean Variance Estimation (
mve
) andevidential
loss are regression loss functions that maximize the likelihood of the target on an estimated uncertainty distribution. When used as loss functions, the outputs of these functions can be used in uncertainty estimation.
Appropriate metrics have been added along with these loss functions.
PR #238
PR #267
Development Environment
GitHub Addons
Added a CONTRIBUTING.md
file with guidelines for how users can contribute to Chemprop. New templates are now available for issue submission that distinguish between different issue types: bug report, feature request, and questions. New templates also suggested for PRs. Templates stored in the .github
directory.
PR #241
Unit Testing
Part of an ongoing effort to include a more complete set of automated tests for Chemprop. Unit tests added for data utils, uncertainty-related loss functions, and the uncertainty evaluation metrics.
PR #232
PR #267
PR #269
Flake8 Formatting
Ongoing effort to standardize the formatting of incoming code. New PRs now request/require the new code to be flake8 compliant in formatting. The utils module and files significantly associated with the new uncertainty function are flake8 compliant.
PR #241
PR #258
PR #267
Update Versioning
Changed the way that version numbers are stored and updated throughout the code.
PR #247
Remove Assertion Errors
Removed many of the assertion errors throughout Chemprop and replaced them with more easily interpretable error types and messages.
PR #257
Bug Fixes
Hyperopt Version Fix
Changed the way that random seeds are passed into hyperopt during hyperparameter optimization to avoid an error where hyperopt stopped supporting a previously supported way of passing numpy seeds.
PR #245
Issue #243
Issue #254
Issue #264
Prediction function output options, multi-molecule splitting, and explicit H atoms in message passing
Features
Allow the inclusion of H atoms in message passing
Default model behavior is to treat H atoms implicity with their neighbors. With the previously existing argument --explicit_h
, explicit H atoms included in the SMILES string would be considered during message passing. This PR adds a new argument --adding_h
, which would make all H atoms treated explicitly during message-passing.
PR #225 and #227
Allow splitting by different key molecules in multi-molecule models
The data-splitting methods scaffold_balanced
and random_with_repeated_smiles
can only consider one molecule per datapoint in adhering to the constraints of which data must share splits with each other. This PR creates an argument --split_key_molecule {int}
, which is used to select which molecule in multi-molecule datasets will be used for the splitting determination.
PR #230
Select split fractions when separate test data is provided
Previously, the split fractions for training/validation were hardcoded as 80/20 when test data was provided via --separate_test_path
. Split fractions can now be specified in this case using --split_sizes
as normal.
PR #230
Additional output options for make_predictions function
This change affects usage of make_predictions
as a python function, rather than in the whole Chemprop workflow. When used as a python function, make_predictions
would return the predictions for a set of SMILES, but would skip the invalid SMILES without indicating which ones were skipped. Now this function has two new option arguments: 1) return_invalid_smiles
that includes invalid SMILES in the output but with "Invalid SMILES" as the prediction value and 2) return_index_dict
that returns predictions of the model in a dictionary keyed to the original data indices.
PR #235
New utility functions for identifying invalid SMILES
New functions have been added to chemprop/data/utils.py to allow users to identify datapoints that have invalid SMILES. These functions are get_invalid_smiles_from_file
and get_invalid_smiles_from_list
.
PR #235
Bug Fixes
Simultaneous use of extra atom features and extra bond features
Bug prevented using extra atom features and extra bond features at the same time and has been resolved.
PR #215
Issue #213
Fixed install error with newer versions of pip
Newer versions of pip failed to install some some chemprop dependencies properly. These dependencies (flake8, pytest, parameterized) were moved to an installation as part of the conda environment rather than by pip. Also, environment build for testing was changed from conda to mamba for better install speed.
PR #215 and #216
Correction in tutorial file
Tutorial file changed to show the proper list of lists format for SMILES.
PR #218
Predicting for a multiclass model with an improper SMILES
When making a prediction for an improper SMILES in a multiclass model, an error would be triggered instead of returning a prediction of "Invalid SMILES". This has been corrected for this case and the parallel case of improper SMILES used with --individual_ensemble_predictions
.
PR #229
Molecule fingerprints generated with extra atom features
Molecule fingerprints could not be predicted when extra atom features were provided as part of the model. This and the parallel issue with extra bond features have been addressed.
PR #234
Issue #233