Skip to content

ImageMol is a molecular image-based pre-training deep learning framework for computational drug discovery.

License

Notifications You must be signed in to change notification settings

AspirinCode/ImageMol

 
 

Repository files navigation

ImageMol: Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework

DOI OSCS Status GitHub GitHub last commit

Abstract

The clinical efficacy and safety of a drug is determined by its molecular targets in the human proteome. However, proteome-wide evaluation of all compounds in human, or even animal models, is challenging. In this study, we present an unsupervised pre-training deep learning framework, termed ImageMol, from 10 million unlabeled drug-like molecules to predict molecular targets of candidate compounds. The ImageMol framework is designed to pretrain chemical representations from unlabeled molecular images based on local- and global-structural characteristics of molecules from pixels. We demonstrate high performance of ImageMol in evaluation of molecular properties (i.e., drug’s metabolism, brain penetration and toxicity) and molecular target profiles (i.e., human immunodeficiency virus) across 10 benchmark datasets. ImageMol shows high accuracy in identifying anti-SARS-CoV-2 molecules across 13 high-throughput experimental datasets from the National Center for Advancing Translational Sciences (NCATS) and we re-prioritized candidate clinical 3CL inhibitors for potential treatment of COVID-19. In summary, ImageMol is an active self-supervised image processing-based strategy that offers a powerful toolbox for computational drug discovery in a variety of human diseases, including COVID-19.

News!

[2022/09/17] Upload more benchmark datasets, including multi-label CYP450, kinases and KinomeScan.

[2022/07/28] Repository installation completed.

Install environment

1. GPU environment

CUDA 10.1

2. create a new conda environment

conda create -n imagemol python=3.7.3

conda activate imagemol

3. download some packages

conda install -c rdkit rdkit

windows:

linux:

pip install torch-cluster torch-scatter torch-sparse torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.4.0%2Bcu101.html

pip install -r requirements.txt

source activate imagemol

Pretraining

1. preparing dataset

Download pretraining data and put it into ./datasets/pretraining/data/

Preprocess dataset:

python ./data_process/smiles2img_pretrain.py --dataroot ./datasets/pretraining/ --dataset data

Note: You can find the toy dataset in ./datasets/toy/pretraining/

2. start to pretrain

Usage:

usage: pretrain.py [-h] [--lr LR] [--wd WD] [--workers WORKERS]
                   [--val_workers VAL_WORKERS] [--epochs EPOCHS]
                   [--start_epoch START_EPOCH] [--batch BATCH]
                   [--momentum MOMENTUM] [--checkpoints CHECKPOINTS]
                   [--seed SEED] [--dataroot DATAROOT] [--dataset DATASET]
                   [--ckpt_dir CKPT_DIR] [--modelname {ResNet18}]
                   [--verbose] [--ngpu NGPU] [--gpu GPU] [--nc NC] [--ndf NDF]
                   [--imageSize IMAGESIZE] [--Jigsaw_lambda JIGSAW_LAMBDA]
                   [--cluster_lambda CLUSTER_LAMBDA]
                   [--constractive_lambda CONSTRACTIVE_LAMBDA]
                   [--matcher_lambda MATCHER_LAMBDA]
                   [--is_recover_training IS_RECOVER_TRAINING]
                   [--cl_mask_type {random_mask,rectangle_mask,mix_mask}]
                   [--cl_mask_shape_h CL_MASK_SHAPE_H]
                   [--cl_mask_shape_w CL_MASK_SHAPE_W]
                   [--cl_mask_ratio CL_MASK_RATIO]

Code to pretrain:

python pretrain.py --ckpt_dir ./ckpts/pretraining/ \
                   --checkpoints 1 \
                   --Jigsaw_lambda 1 \
                   --cluster_lambda 1 \
                   --constractive_lambda 1 \
                   --matcher_lambda 1 \
                   --is_recover_training 1 \
                   --batch 256 \
                   --dataroot ./datasets/pretraining/ \
                   --dataset data \
                   --gpu 0,1,2,3 \
                   --ngpu 4

For testing, you can simply pre-train ImageMol using single GPU on toy dataset:

python pretrain.py --ckpt_dir ./ckpts/pretraining-toy/ \
                   --checkpoints 1 \
                   --Jigsaw_lambda 1 \
                   --cluster_lambda 1 \
                   --constractive_lambda 1 \
                   --matcher_lambda 1 \
                   --is_recover_training 1 \
                   --batch 16 \
                   --dataroot ./datasets/toy/pretraining/ \
                   --dataset data \
                   --gpu 0 \
                   --ngpu 1

Finetuning

1. Download pre-trained ImageMol

You can download pre-trained model and push it into the folder ckpts/

2. Finetune with pre-trained ImageMol

a) You can download molecular property prediciton datasets, CYP450 datasets, multi-label CYP450 dataset, SARS-CoV-2 datasets, kinases datasets and KinomeScan datasets and put it into datasets/finetuning/

b) The usage is as follows:

usage: finetune.py [-h] [--dataset DATASET] [--dataroot DATAROOT] [--gpu GPU]
                   [--workers WORKERS] [--lr LR] [--weight_decay WEIGHT_DECAY]
                   [--momentum MOMENTUM] [--seed SEED] [--runseed RUNSEED]
                   [--split {random,stratified,scaffold,random_scaffold,scaffold_balanced}]
                   [--epochs EPOCHS] [--start_epoch START_EPOCH]
                   [--batch BATCH] [--resume PATH] [--imageSize IMAGESIZE]
                   [--image_model IMAGE_MODEL] [--image_aug]
                   [--task_type {classification,regression}]
                   [--save_finetune_ckpt {0,1}] [--log_dir LOG_DIR]

c) You can run ImageMol by simply using the following code:

python finetune.py --gpu ${gpu_no} \
                   --save_finetune_ckpt ${save_finetune_ckpt} \
                   --log_dir ${log_dir} \
                   --dataroot ${dataroot} \
                   --dataset ${dataset} \
                   --task_type ${task_type} \
                   --resume ${resume} \
                   --image_aug \
                   --lr ${lr} \
                   --batch ${batch} \
                   --epochs ${epoch}

For example:

python finetune.py --gpu 0 \
                   --save_finetune_ckpt 1 \
                   --log_dir ./logs/toxcast \
                   --dataroot ./datasets/finetuning/benchmarks \
                   --dataset toxcast \
                   --task_type classification \
                   --resume ./ckpts/ImageMol.pth.tar \
                   --image_aug \
                   --lr 0.5 \
                   --batch 64 \
                   --epochs 20

Note: You can tune more hyper-parameters during fine-tuning (see b) Usage).

Finetuned models

To ensure the reproducibility of ImageMol, we provided finetuned models for eight datasets, including:

You can evaluate the finetuned model by using the following command:

python evaluate.py --dataroot ${dataroot} \
                   --dataset ${dataset} \
                   --task_type ${task_type} \
                   --resume ${resume} \
                   --batch ${batch}

For example:

python evaluate.py --dataroot ./datasets/finetuning/benchmarks \
                   --dataset toxcast \
                   --task_type classification \
                   --resume ./toxcast.pth \
                   --batch 128

GradCAM Visualization

More about GradCAM heatmap can be found from this link: https://drive.google.com/file/d/1uu3Q6WLz8bJqcDaHEG84o3mFvemHoA2v/view?usp=sharing

To facilitate observation of high-confidence regions in the GradCAM heatmap, we use a confidence to filter out lower-confidence regions, which can be found from this link: https://drive.google.com/file/d/1631kSSiM_FSRBBkfh7PwI5p3LGqYYpMc/view?usp=sharing

run script

We also provide a script to generate GradCAM heatmaps:

usage: main.py [-h] [--image_model IMAGE_MODEL] --resume PATH --img_path
               IMG_PATH --gradcam_save_path GRADCAM_SAVE_PATH
               [--thresh THRESH]

you can run the following script:

python main.py --resume ${resume} \
               --img_path ${img_path} \
               --gradcam_save_path ${gradcam_save_path} \
               --thresh ${thresh}

Process your own dataset

If you want to process your own dataset and obtain molecular images, use the following steps:

  1. Preprocessing smiles: Please use the method preprocess_list(smiles) of this link to process your raw SMILES data;
  2. Transforming smiles to image: Convert canonical smiles to molecular images using dataloader.image_dataloader.Smiles2Img(smis, size=224, savePath=None)

Reference

If you use ImageMol in scholary publications, presentations or to communicate with your satellite, please cite the following work that presents the algorithms used:

Zeng, X., Xiang, H., Yu, L. et al. Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell 4, 1004–1016 (2022). https://doi.org/10.1038/s42256-022-00557-6

@article{zeng2022accurate,
  title={Accurate prediction of molecular targets using a self-supervised image representation learning framework},
  author={Zeng, Xiangxiang and Xiang, Hongxin and Yu, Linhui and Wang, Jianmin and Li, Kenli and Nussinov, Ruth and Cheng, Feixiong},
  journal={Research Square},
  pages={rs--3},
  year={2022},
  publisher={American Journal Experts}
}

If you additionally want to cite this code package, please cite as follows:

@software{hongxinxiang_2022_7088986,
  author       = {HongxinXiang},
  title        = {HongxinXiang/ImageMol: v1.0.0},
  month        = sep,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v1.0.0},
  doi          = {10.5281/zenodo.7088986},
  url          = {https://doi.org/10.5281/zenodo.7088986}
}

About

ImageMol is a molecular image-based pre-training deep learning framework for computational drug discovery.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%