Skip to content

Secure multiparty computation for privacy-preserving drug discovery

Notifications You must be signed in to change notification settings

rongma6/QSARMPC_DTIMPC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QSARMPC_DTIMPC

Secure multiparty computation for privacy-preserving drug discovery.

Install

  1. Clone this repository.
git clone https://github.com/rongma6/QSARMPC_DTIMPC.git
cd QSARMPC_DTIMPC
  1. Download the PrivPy framework (Li and Xu, 2019) bazel-bin.tar.gz, put it under the QSARMPC_DTIMPC directory and unpack it.
tar -xzvf bazel-bin.tar.gz
  1. Install the dependencies. This system is based on Python 3.6 and requires the following packages. On Ubuntu 18,
sudo apt-get install python3.6 python3-pip
# replace the default python with python 3.6
sudo rm -rf /usr/bin/python
sudo ln -s /usr/bin/python3.6 /usr/bin/python

sudo apt-get install libgmp3-dev libmpfr-dev libmpc-dev build-essential
sudo pip3 install numpy sklearn pandas absl-py pycryptodomex

Datasets

For both QSAR and DTI prediction tasks, the datasets for the MPC algorithms are in mydata. We provide both a full dataset for a typical experiment and a toy dataset for a quick start. Below we present how to prepare datasets, as a reference to generate new datasets.

For QSAR prediction, the full dataset is generated by preprocessing the METAB dataset in the Kaggle competition (Ma et al., 2015). The original datasets can be downloaded from https://pubs.acs.org/doi/abs/10.1021/ci500747n (ci500747n_si_002.zip in the Supporting Information section). The preprocessing code is in prep/QSAR_prep.py.

For DTI prediction, the full dataset is generated by preprocessing the dataset in DTINet (Luo et al., 2017). The preprocessing including publically computing features of proteins (prep/DTI_public.py), locally computing 1024-bit fingerprint vectors of drugs (prep/DTI_local.py), and randomly splitting training and test samples for evaluation (prep/DTI_valid.py). More detailedly, mydata/DTI_full/data_luo/mat_drug_disease.txt is the same file as mat_drug_disease.txt in the DTINet dataset (Luo et al., 2017). mydata/DTI_full/data_prep/public_protein_feature_800.npy is the public protein features with the dimension as 800, generated by

python prep/DTI_public.py 800 20 0.5

where 20 is the maximum number of iterations in RWR and 0.5 is the restart probability in RWR. mydata/DTI_full/data_prep/finger_rdkit_1024.npy is the local 1024-bit fingerprint vectors of drugs, generated by

python prep/DTI_local.py 1024

Files in mydata/DTI_full/trial1/ are the training and test datasets, generated by

python prep/DTI_valid.py 1 0 dense

where one can change the random seed, the fold id and the format of the training dataset. _train_dense stands for the training dataset, which is a dense matrix with ones representing the positive samples and zeros for other elements. Note that only the positive samples could affect the training process of DTIMPC, so set the format of the training dataset as dense. If other algorithms require negative samples, we provide the same number of negative samples as the positive samples when setting the format of the training dataset as sparse. _test1basic is corresponding to the test dataset with 1:1 positive and negative samples; _test1basic and _test9extra together are corresponding to the test dataset with 1:10 positive and negative samples; _test1basic and _testallextra together are corresponding to the test dataset with all samples. We ensure that in all these three settings, the test dataset consists of different samples from the training dataset. Here, we show the whole pipeline as the data are from one entity. In practice, the private data are owned by different clients.

Set configuration files

The data directory and the hyperparameters are set in configuration files. For example, set the running data directory as mydata/DTI_toy/ and the maximum number of iterations in privacy-preserving RWR for drugs as 20:

data_dir mydata/DTI_toy/
maxiterd 20

The configuration file for QSARMPC and DTIMPC are conf/QSAR.conf and conf/DTI.conf, respectively. We provide an example of the configuration file for each dataset. You can copy and use them as configuration files. For example,

cp conf/QSAR_toy.conf conf/QSAR.conf
cp conf/DTI_toy.conf conf/DTI.conf

Or you can set your configuration files as you want.

IMPORTANT TIPS:

  1. Always set configuration files conf/QSAR.conf and conf/DTI.conf BEFORE you run the MPC algorithms.
  2. Note that the configuration files should include ALL items as in the examples of configuration files. And for the QSAR task, conf/QSAR.conf should include all these items IN ORDER.

Run

Open two terminal windows.

  1. Run ./bazel-bin/run at the first window.
  2. After about 10 seconds, run ./bazel-bin/client in the second window. Follow the prompts and run the corresponding model. For example,
What do you want to run? DTI or QSAR: DTI

IMPORTANT TIPS:

  1. To stop the process, use Ctrl + C and make kill commands.
  2. The experiments on the full datasets require a large amount of memory. We have tested on a machine with 96G memory.
  3. The PrivPy framework requires the Ubuntu 18 environment.

Check the results

The results will be printed during running. Or you can check the results in the result directory after running.

For the DTI task, the predicted DTI scores will be in result/Re.txt and the AUPR and AUROC for the three settings (i.e., on 1:1 positive and negative samples, 1:10 positive and negative samples and all samples) will be in result/metrics.txt.

For the QSAR task, the predicted bioactivities for the testing data will be in result/ypred_result.txt and the squared Pearson correlation coefficient will be in result/r2_result.txt.

Contacts

If you have any questions or comments, please feel free to email Rong Ma (ma-r17@mails.tsinghua.edu.cn) and/or Jianyang Zeng (zengjy321@tsinghua.edu.cn).