Benchmarking of reaction representations

Installation

BERT+RXNFP environment

Due to the various dependencies of different packages, it is not possible to have a single environment that satisfies them all. In particular, the rxnfp package relies on python 3.7 (latest python version for which is works). Therefore it is recommended to have a separate environment to run the BERT+RXNFP models:

conda create -n rxnfp python=3.7 -y
conda activate rxnfp
conda install -c rdkit rdkit=2020.03.3 -y
conda install -c tmap tmap -y
git clone https://github.com/rxn4chemistry/rxnfp
pip install -e rxnfp

Note that the above does not work on Mac computers with the M1 chip onwards, since they do not support python 3.7. This env has been tried and tested on several Linux platforms.
It is recommended to run BERT+DRFP in a GPU-enabled environment.

Chemprop environment

This can be installed using conda create -n <environment-name> --file requirements_chemprop.txt
Note a modified version of chemprop is used (forked from 1.6.1), which allows for hyperparameter optimisation only on the first CV fold and reports timings for training and inference separately.
It is recommended to run Chemprop in a GPU-enabled environment.

Fingerprints environment

This can be installed using conda create -n <environment-name> --file environment_fingerprints.yml
The key packages are qml (for SLATM and B2R2 dependencies), drfp (for DRFP), rdkit (for MFP, careful with the version) for the fingerprints and scikit-learn and numpy for the models
These fingerprints will not benefit from a GPU environment, run on CPU only.

EquiReact environment

Consult the dedicated repo for installation instructions: https://github.com/lcmd-epfl/EquiReact
It is recommended to run EquiReact in a GPU-enabled environment.

Using the repo / reproducing the results

2D/3D fingerprints

2D reps here are the DRFP and MFP
3D reps are SLATM and $B^2R^2_l$, all in the difference variation for reactions
To generate/load the reps and perform CV-fold cross-validated predictions with a train fraction of tr, the file src/run_all_fingerprints.py should be used
Here, the datasets are specified as additional arguments, -c for Cyclo-23-TS, -g for GDB7-22-TS and -p for Proparg-21-TS
For example, for a train fraction of 0.8, with corresponding test and validation of 0.1 each, and 10-fold CV, the command python src/run_all_fingerprints.py -c -g -p --CV=10 --train=0.8 will generate/load representations and run the models or load the results
The hyperparameters for each dataset can be found in src/hypers.py

Language models

A pre-trained BERT model is fine-tuned on the appropriate datasets (cyclo, gdb, or proparg)
Here, the dataset is provided as an optional argument to src/train_bert.py
To train on and then predict on the cyclo dataset for example, the command is python src/train_bert.py -t -p --dataset='gdb' --test_size=0.2
The optional --CV flag determines whether to shuffle the train and predict datasets over the CV iterations
The prediction MAE of 10 CV iterations are saved to outs/cyclo_bert_pretrained/results.txt (and the equivalent for the other datasets)
The hyperparameters for each dataset can be found in src/hypers.py

Graph based models

The CGR model was run using the provided atom mapping, automatic atom mapping from rxnmapper, and no atom mapping
The data are in the csv files data/gdb7-22-ts/ccsdtf12_dz.csv for GDB7-22-TS, data/cyclo/full_dataset.csv for Cyclo-23-TS and data/proparg/data.csv for Proparg-21-TS
SMILES with the provided ("true") atom mapping are in the rxn_smiles (Cyclo-23-TS, GDB7-22-TS) or rxn_smiles_mapped (Proparg-21-TS) columns of each csv file
To run rxnmapper, run src/mapper.py
Atom mapped SMILES from rxnmapper are in the rxn_smiles_rxnmapper column of each csv file
Python file specifying how each CGR model was run is src/cgr.py. Note the hypers for each dataset are in data (these are referenced by cgr.py
Results of the CGR runs can be found in results

EquiReact

The submission files for each of the EquiReact jobs can be found in the equireact-subfiles directory. Note this will need to be changed somewhat depending on your cluster structure (there are hard-coded paths for scratch for example).
The results of running EquiReact can be found in the equireact-results directory

Name		Name	Last commit message	Last commit date
Latest commit History 372 Commits
data		data
equireact-results		equireact-results
equireact-subfiles		equireact-subfiles
figures		figures
make_plots		make_plots
outs		outs
results		results
src		src
.gitignore		.gitignore
README.md		README.md
environment_fingerprints.yml		environment_fingerprints.yml
requirements_chemprop.txt		requirements_chemprop.txt
requirements_fingerprints.txt		requirements_fingerprints.txt

lcmd-epfl/benchmark-barrier-learning

Folders and files

Latest commit

History

Repository files navigation

Benchmarking of reaction representations

Installation

BERT+RXNFP environment

Chemprop environment

Fingerprints environment

EquiReact environment

Using the repo / reproducing the results

2D/3D fingerprints

Language models

Graph based models

EquiReact

About

Resources

Stars

Watchers

Forks