Skip to content

A repository for evaluating single-step retrosynthesis algorithms

License

Notifications You must be signed in to change notification settings

OptiMaL-PSE-Lab/EvalRetro

Repository files navigation

Optimal PSE logo


EvalRetro

Python 3.10 License: MIT

A repository for evaluating single-step retrosynthesis algorithms.

This code was tested for Linux (Ubuntu), Windows and Mac OS.

Environment

Set up a new environment by running the following line in your terminal:

conda env create -n evalretro --file environment.yml 
pip install rxnfp --no-deps

For MacOS, replace the environment.yml file with:

conda env create -n evalretro --file environment_mac.yml
pip install rxnfp --no-deps

Testing your own algorithm

📚 Discover more about testing your own single-step algorithm!

To test your own retrosynthetic prediction on a test dataset (e.g. USPTO-50k), follow the steps below:

  1. Place the file containing the predictions per molecular target in the ./data/"key" directory ("key" as defined in config file - step 2.)

    Please ensure the correct layout of your prediction file as shown in File Structure

  2. Enter the configuration details in the config under new_config.json by replacing the example

    Please refer to Configuration Structure for the layout

  3. To ensure that the file has the correct structure, run the following line of code:
    conda activate evalretro
    python data_import.py --config_name new_config.json 
    
  4. If no error is logged in step 3, the algorithm can be evaluated with:
    python main.py --k_back 10 --k_forward 2 --invsmiles 20 --fwd_model 'gcn' --config_name 'new_config.json' --quick_eval True  
    
    Within the script, the following arguments can be adjusted:
    • k_back: Evaluation includes k retrosynthesis predictions per target
    • k_forward: Forward model includes k target predictions per reactant set.
    • fwd_model: Type of forward reaction prediction model. So far, only gcn is included.
    • config_name: Name of the config file to be used
    • quick_eval: Boolean - prints the results (averages) for evaluation metrics directly to the terminal.
    • data_path: The path to the folder that contains your file, default = ./data

For further help, look at the Jupyter notebook provided in the examples directory

File Structure

The file should follow one of the following two formats with the first row entry per target molecule being the ground truth reaction i.e. 1 ground-truth reaction + N predictions per target:

  1. Line-Separated file: N+1 reactions per molecular target are separated by an empty line (example: TiedT)
  2. Index-Separated file: N+1 reactions per molecular target are separated by different indices (example: G2Retro)

The headers within the file should contain the following columns: ["index", "target", "reactants"]

Configuration File

The configuration for the benchmarked algorithm is shown in the config directory. Specifying the configuration is important so that the data file is processed correctly by the code. The structure is in .json format and should contain:

"key":{
    "file":"file_name.csv",       # The name of the prediction file in ./data/"key" directory
    "class":"LineSeparated",      # One of: ["LineSeparated", "IndexSeparated"]
    "skip":bool,                  # false if LineSeparated; true if IndexSeparated
    "delimiter":"comma",          # Delimiter of file. One of: ["comma", " "]
    "colnames": null,             # null - unless data file has different header to ["idx", "target", "reactants"]
    "preprocess":bool,            # false - in most cases
}

Reproducibility

🔍 Step-by-step guide on how to reproduce results presented in the paper.
  1. Download all data files from dropbox and place inside ./data directory

    The datafiles related to all benchmarked algorithms can be found below:
    https://doi.org/10.6084/m9.figshare.25325623.v1

  2. Run the following lines of code within your terminal:
    conda activate evalretro
    python data_import.py --config_name raw_data.json
    python main.py --k_back 10 --k_forward 2 --invsmiles 20 --fwd_model 'gcn' --config_name 'raw_data.json' --quick_eval False
    
  3. Run python plotting.py to generate figures and tables

Interpretability Study

🚀 Click here to find out more details about interpretability of ML-based retrosynthesis models.

The code related to the interpretability study is found in the interpretability folder.

Environment

The environment can be set-up running the following lines of code:

conda create -n rxn_exp python=3.10
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install pyg -c pyg
conda install scikit-learn -c conda-forge
conda install tqdm matplotlib pandas
pip install rdkit

Data Files

Install both folders within ./data_interpretability using the following link and place them into the ./interpret directory:
https://doi.org/10.6084/m9.figshare.25325644.v1

Reproducibility

Pre-trained models are provided in the dropbox. However, models can be retrained by running:

conda activate rxn_exp
cd interpret
python train.py --model_type DMPNN

The model_type can be chosen from: DMPNN, EGAT and GCN.

To test the trained models (i.e. EGAT and DMPNN) and create the plots as in the paper, run:

conda activate rxn_exp
python inference.py

Note: The plots for the GNN models may slightly differ compared to the paper due to the stochastic nature of GNNExplainer. Example of interpretability case study

Releases

No releases published

Packages

No packages published

Languages