Skip to content

CDDLeiden/PCMol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PCMol

DOI License: MIT

A multi-target model for de novo molecule generation. By using the internal protein representations of the AlphaFold[1] model, a single SMILES-based transformer can generate relevant molecules for thousands of protein targets (embeddings are available for 4,331 proteins).

The model was trained using bioactivity data from the Papyrus[2] dataset (661,613 unique protein-ligand pairs in total, 6,249,253 after augmentation).


alt text


Paper & Authors

The preprint is available on ChemRxiv:

https://chemrxiv.org/engage/chemrxiv/article-details/65d47632e9ebbb4db9c63988


alt text


Installation

1. Setup script (recommended)

The setup script will install the required dependencies and download the pretrained model.

git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
chmod +x setup.sh
bash setup.sh

2. Conda (alternative)

The conda route requires the user to download the pretrained model manually (link below).

# Setting up a fresh conda environment
git clone https://github.com/CDDLeiden/pcmol.git && cd pcmol
conda env create -f environment.yml && conda activate pcmol
python -m pip install -e .

Pretrained model

*When not using the setup script, the pretrained model can be downloaded from here (mirror). It should then be placed in the .../pcmol/data/models folder.


Generating molecules for a particular target

1. Using a script (conda route)

# Run the model on a single target using Accession ID (generates 10 SMILES strings)
conda activate pcmol
python pcmol/generate.py --target P29275

# If GPU is not available
python pcmol/generate.py --target P29275 --device cpu

If available, the appropriate AlphaFold2 embeddings to be used as input to the model will be downloaded automatically. The generated molecules are saved in the data/results folder.

2. Calling the generator directly

To generate molecules for a particular target, the Runner class can be used directly. The generate_smiles method returns a list of SMILES strings for a target protein specified by its Accession ID.

from pcmol import Runner

model = Runner(model="XL")
SMILES = model.targetted_generation(target="P29275", num_mols=100)

List of supported protein targets

The model currently depends on the availability of AlphaFold2 embeddings for the target protein. The list of supported targets can be found in the data/targets.txt file.


References

[1]: Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583-589.

[2]: Béquignon, O. J., Bongers, B. J., Jespers, W., IJzerman, A. P., van der Water, B., & van Westen, G. J. (2023). Papyrus: a large-scale curated dataset aimed at bioactivity predictions. Journal of cheminformatics, 15(1), 3.

Releases

No releases published

Packages

No packages published