Secure multiparty computation for privacy-preserving drug discovery.
- Clone this repository.
git clone https://github.com/rongma6/QSARMPC_DTIMPC.git
cd QSARMPC_DTIMPC
- Download the PrivPy framework (Li and Xu, 2019) bazel-bin.tar.gz, put it under the QSARMPC_DTIMPC directory and unpack it.
tar -xzvf bazel-bin.tar.gz
- Install the dependencies. This system is based on Python 3.6 and requires the following packages. On Ubuntu 18,
sudo apt-get install python3.6 python3-pip
# replace the default python with python 3.6
sudo rm -rf /usr/bin/python
sudo ln -s /usr/bin/python3.6 /usr/bin/python
sudo apt-get install libgmp3-dev libmpfr-dev libmpc-dev build-essential
sudo pip3 install numpy sklearn pandas absl-py pycryptodomex
For both QSAR and DTI prediction tasks, the datasets for the MPC algorithms are in mydata
. We provide both a full dataset for a typical experiment and a toy dataset for a quick start. Below we present how to prepare datasets, as a reference to generate new datasets.
For QSAR prediction, the full dataset is generated by preprocessing the METAB dataset in the Kaggle competition (Ma et al., 2015). The original datasets can be downloaded from https://pubs.acs.org/doi/abs/10.1021/ci500747n (ci500747n_si_002.zip
in the Supporting Information section). The preprocessing code is in prep/QSAR_prep.py
.
For DTI prediction, the full dataset is generated by preprocessing the dataset in DTINet (Luo et al., 2017). The preprocessing including publically computing features of proteins (prep/DTI_public.py
), locally computing 1024-bit fingerprint vectors of drugs (prep/DTI_local.py
), and randomly splitting training and test samples for evaluation (prep/DTI_valid.py
).
More detailedly, mydata/DTI_full/data_luo/mat_drug_disease.txt
is the same file as mat_drug_disease.txt
in the DTINet dataset (Luo et al., 2017).
mydata/DTI_full/data_prep/public_protein_feature_800.npy
is the public protein features with the dimension as 800, generated by
python prep/DTI_public.py 800 20 0.5
where 20 is the maximum number of iterations in RWR and 0.5 is the restart probability in RWR.
mydata/DTI_full/data_prep/finger_rdkit_1024.npy
is the local 1024-bit fingerprint vectors of drugs, generated by
python prep/DTI_local.py 1024
Files in mydata/DTI_full/trial1/
are the training and test datasets, generated by
python prep/DTI_valid.py 1 0 dense
where one can change the random seed, the fold id and the format of the training dataset. _train_dense
stands for the training dataset, which is a dense matrix with ones representing the positive samples and zeros for other elements. Note that only the positive samples could affect the training process of DTIMPC, so set the format of the training dataset as dense
. If other algorithms require negative samples, we provide the same number of negative samples as the positive samples when setting the format of the training dataset as sparse
. _test1basic
is corresponding to the test dataset with 1:1 positive and negative samples; _test1basic
and _test9extra
together are corresponding to the test dataset with 1:10 positive and negative samples; _test1basic
and _testallextra
together are corresponding to the test dataset with all samples. We ensure that in all these three settings, the test dataset consists of different samples from the training dataset.
Here, we show the whole pipeline as the data are from one entity. In practice, the private data are owned by different clients.
The data directory and the hyperparameters are set in configuration files. For example, set the running data directory as mydata/DTI_toy/
and the maximum number of iterations in privacy-preserving RWR for drugs as 20:
data_dir mydata/DTI_toy/
maxiterd 20
The configuration file for QSARMPC and DTIMPC are conf/QSAR.conf
and conf/DTI.conf
, respectively. We provide an example of the configuration file for each dataset. You can copy and use them as configuration files. For example,
cp conf/QSAR_toy.conf conf/QSAR.conf
cp conf/DTI_toy.conf conf/DTI.conf
Or you can set your configuration files as you want.
IMPORTANT TIPS:
- Always set configuration files
conf/QSAR.conf
andconf/DTI.conf
BEFORE you run the MPC algorithms. - Note that the configuration files should include ALL items as in the examples of configuration files. And for the QSAR task,
conf/QSAR.conf
should include all these items IN ORDER.
Open two terminal windows.
- Run
./bazel-bin/run
at the first window. - After about 10 seconds, run
./bazel-bin/client
in the second window. Follow the prompts and run the corresponding model. For example,
What do you want to run? DTI or QSAR: DTI
IMPORTANT TIPS:
- To stop the process, use
Ctrl + C
andmake kill
commands. - The experiments on the full datasets require a large amount of memory. We have tested on a machine with 96G memory.
- The PrivPy framework requires the Ubuntu 18 environment.
The results will be printed during running. Or you can check the results in the result
directory after running.
For the DTI task, the predicted DTI scores will be in result/Re.txt
and the AUPR and AUROC for the three settings (i.e., on 1:1 positive and negative samples, 1:10 positive and negative samples and all samples) will be in result/metrics.txt
.
For the QSAR task, the predicted bioactivities for the testing data will be in result/ypred_result.txt
and the squared Pearson correlation coefficient will be in result/r2_result.txt
.
If you have any questions or comments, please feel free to email Rong Ma (ma-r17@mails.tsinghua.edu.cn) and/or Jianyang Zeng (zengjy321@tsinghua.edu.cn).