Skip to content

Latest commit

 

History

History
161 lines (113 loc) · 6.52 KB

INSTRUCTIONS.md

File metadata and controls

161 lines (113 loc) · 6.52 KB

Instructions

This documents outlines the process to train SynNet from scratch step-by-step.

⚠️ It is still a WIP.

You can use any set of reaction templates and building blocks, but we will illustrate the process with the Hartenfeller-Button reaction templates and Enamine building blocks.

Note: This project depends on a lot of exact filenames. For example, one script will save to file, the next will read that file for further processing. It is not a perfect approach - we are open to feedback.

Let's start.

Step-by-Step

  1. Prepare reaction templates and building blocks.

    Extract SMILES from the .sdf file from enamine.net.

    python scripts/00-extract-smiles-from-sdf.py \
        --input-file="data/assets/building-blocks/enamine-us.sdf" \
        --output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz"
  2. Filter building blocks.

    We proprocess the building blocks to identify applicable reactants for each reaction template. In other words, filter out all building blocks that do not match any reaction template. There is no need to keep them, as they cannot act as reactant. In a first step, we match all building blocks with each reaction template. In a second step, we save all matched building blocks and a collection of Reactions with their available building blocks.

    python scripts/01-filter-building-blocks.py \
        --building-blocks-file "data/assets/building-blocks/enamine-us-smiles.csv.gz" \
        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
        --output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
        --output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose

    💡 All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables.

  3. Pre-compute embeddings

    We use the embedding space for the building blocks a lot. Hence, we pre-compute and store the building blocks.

    python scripts/02-compute-embeddings.py \
        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
        --output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \
        --featurization-fct "fp_256"
  4. Generate synthetic trees

    Herein we generate the data used for training the networks. The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree.

    # Generate synthetic trees
    python scripts/03-generate-syntrees.py \
        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
        --output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
        --number-syntrees "600000"

    In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting. That is, we filter out trees, whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5.

    # Filter
    python scripts/04-filter-syntrees.py \
        --input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
        --output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
        --verbose

    Each synthetic tree is serializable and so we save all trees in a compressed .json file.

  5. Split synthetic trees into train,valid,test-data

    We load the .json-file with all synthetic trees and straightforward split it into three files: {train,test,valid}.json. The default split ratio is 6:2:2.

    python scripts/05-split-syntrees.py \
            --input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
            --output-dir "data/pre-process/syntrees/" --verbose
  6. Featurization

    We featurize each synthetic tree. That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it. This results in a "state" vector and a a corresponding "super step" vector. We call it "super step" here, as it contains all featurized data for all networks.

    python scripts/06-featurize-syntrees.py \
        --input-dir "data/pre-process/syntrees/" \
        --output-dir "data/featurized/" --verbose

    This script will load the {train,valid,test} data, featurize it, and save it in

    • <output-dir>/{train,valid,test}_states.npz and
    • <output-dir>/{train,valid,test}_steps.npz.

    The encoders for the molecules must be provided in the script. A short text summary of the encoders will be saved as well.

  7. Split features

    Up to this point, we worked with a (featurized) synthetic tree as a whole, now we split it up to into "consumable" input/output data for each of the four networks. This includes picking the right featurized data from the "super step" vector from the previous step.

    python scripts/07-split-data-for-networks.py \
        --input-dir "data/featurized/"

    This will create 24 new files (3 splits, 4 networks, X + y). All new files will be saved in <input-dir>/Xy.

  8. Train the networks

    Finally, we can train each of the four networks in src/synnet/models/ separately. For example:

    python src/synnet/models/act.py

After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.

You can also perform molecular optimization using a genetic algorithm.

Please refer to the README.md for inference instructions.

Auxiallary Scripts

Visualizing trees

To visualize trees, there is a hacky script that represents Synthetic Trees as mermaid diagrams.

To demo it:

python src/synnet/visualize/visualizer.py

Still to be implemented: i) target molecule, ii) "end" action

To render the markdown file incl. the diagram directly in VS Code, install the extension vscode-markdown-mermaid and use the built-in markdown preview.

Info: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens.