Scalable and Provably Accurate Algorithms for Differentially Private Distributed Decision Tree Learning
K. W. Wang, T. Dick, and M.-F. Balcan. Scalable and provably accurate algorithms for differentially private distributed decision tree learning. In AAAI Workshop on Privacy-Preserving Artificial Intelligence, 2020
@inproceedings{Wang2020ScalableAP,
title={Scalable and provably accurate algorithms for differentially private distributed decision tree learning},
author={Kai Wen Wang and Travis Dick and Maria-Florina Balcan},
booktitle={AAAI Workshop on Privacy-Preserving Artificial Intelligence},
year={2020}
}
The DP-TopDown algorithm is implemented in C++, so we need GCC
and CMake
to
compile it.
First make sure GCC v7.4
is installed. Run
sudo apt update && sudo apt install -y build-essential
Then, install CMake
(at least version 3.12).
Download, unzip, and cd into the folder.
Then run
./bootstrap && make -j10
sudo make install
For reference:
Run the following commands (taken from the link).
Note: the make commands can take a while.
sudo apt-get install -y autoconf automake libtool curl make g++ unzip
git clone --branch v3.6.1 --recursive https://github.com/protocolbuffers/protobuf.git
cd protobuf
./autogen.sh
./configure
make -j4
make check -j4
sudo make install -j4
sudo ldconfig
To build, simply run the following command.
mkdir build && cd build
cmake .. && make -j10
If this is the first time building, the build can fail because protobuf headers are out of date. To update the headers, run
For C++:
protoc -I=./data/ --cpp_out=third_party/protobuf/ ./data/dataset.proto
For Python:
protoc -I=./data/ --python_out=python/ ./data/dataset.proto
You may need to install protobuf
for Python.
/cpp/single_run.cpp
is our main script.
To parallelize the process, we used AWS Batch.
It takes parameters via environment variables, as described below.
ENVVAR | DEFAULT VALUE |
---|---|
SEED | 42 |
TRAINING_FRACTION | 1 |
LEAF_PRIVACY_FRACTION | 0.5 |
DATASET | adult |
BUDGET_FN | decay |
We ran our experiments with a docker image in AWS Batch with Dockerfile in
aws/Dockerfile
which calls the script aws/run.sh
. In the script, the seed
environment variable is set to be AWS_BATCH_JOB_ARRAY_INDEX
, automatically
set by AWS Batch. In addition, we must specify an additional environment variable
OUT_DIRECTORY
to store the outputted .csv
files.
There is an optional parameter START_INDEX
which offsets the seed,
since AWS_BATCH_JOB_ARRAY_INDEX
ranges from 0 to ARRAY_SIZE
.
To run a bunch of experiments via a script, we use aws/submit_job.py
.
We can also run cpp/single_run.cpp
locally. An example test run would be
SEED=42 TRAINING_FRACTION=1 LEAF_PRIVACY_FRACTION=0.5 DATASET=adult BUDGET_FN=decay ./single_run
This run would sweep across all values of alpha and all private split algorithms.
Our paper includes experiments on the following datasets.
After preprocessing, our datasets are in data.zip
.
To train and test on a new dataset named X
, do the following steps:
- Create protobufs
X_train
andX_test
with save functions incpp/utils.h
orpython/utils.py
- Include new protobufs into
./data
- Define new splitting class in
/cpp/split.h
and run scripts
Format C++ files with
clang-format -i path_to_cpp_file
and format Python files with
black path_to_python_file