dblink-experiments

This repository contains scripts, data and documentation for reproducing experiments in the following paper:

Marchant, N. G., Kaplan, A., Elazar, D. N., Rubinstein, B. I. P. and Steorts, R. C. (2021). d-blink: Distributed End-to-End Bayesian Entity Resolution. Journal of Computational and Graphical Statistics, 30(2), 406–421. DOI: 10.1080/10618600.2020.1825451 arXiv: 1909.06039.

which accompanies our dblink Spark package.

The workflow for reproducing the experiments involves:

Obtaining the five data sets. See the section below for details.
Obtaining/building the dblink Spark package. Instructions are provided in the dblink repository.
Running the experiments on a Spark cluster. We tested dblink on a local server in pseudocluster mode (see local directory) and on a Spark cluster in AWS (see aws directory).
After running the experiments, the plots in the paper can be generated using the R scripts in the plots directory.

Directory structure

aws: contains scripts/config files/results for experiments run on a YARN deployment of Spark using the Amazon Elastic MapReduce service.
data: contains two of the data sets used in the experiments.
local: contains scripts/config files/results for experiments run on a local server in pseudocluster mode.
plots: contains R scripts for generating the plots in the paper.

Output of dblink

Each dblink experiment produces the following files which are used to populate the tables and generate the plots in the paper:

cluster-size-distribution.csv
diagnostics.csv
evaluation-results.txt
partition-sizes.csv
run.txt

Since running these experiments can be time consuming (some take approx. 24 hours) we have included the results in the repository. See the local/results/ and aws/results directories.

For a full description of dblink output, see the documentation here.

Data sets

We evaluated d-blink on five data sets in our paper. Unfortunately, we are unable to make all of the data sets publicly available due to usage restrictions. Below we describe how to access each data set. Feel free to contact us for further information.

ABSEmployee. A synthetic data set used internally for linkage experiments at the Australian Bureau of Statistics (ABS). It simulates an employment census and two supplementary surveys. The data is available for download here.
NCVR. Two snapshots from the North Carolina Voter Registration database taken two months apart. The snapshots are filtered to include only those voters whose details changed over the two-month period. The data set was generously provided by Peter Christen. We are unable to share this publicly.
NLTCS. A subset of the National Long-Term Care Survey comprising the 1982, 1989 and 1994 waves. We use the SEX, DOB, STATE and REGOFF attributes. The data set is available from NACDA after signing a data use agreement.
SHIW0810. A subset from the Bank of Italy's Survey on Household Income and Wealth comprising the 2008 and 2010 waves. Use of this data is subject to conditions described here. We have written a script which downloads and pre-processes the data, available here.
RLdata10000. A synthetic data set provided with the RecordLinkage R package. We do not have permission to redistribute this data set, however we have written an R script which saves the data in CSV format.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
aws		aws
data		data
local		local
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws

aws

data

data

local

local

plots

plots

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

dblink-experiments

Directory structure

Output of dblink

Data sets

About

Releases

Packages

Contributors 2

Languages

License

cleanzr/dblink-experiments

Folders and files

Latest commit

History

Repository files navigation

dblink-experiments

Directory structure

Output of dblink

Data sets

About

Topics

Resources

License

Stars

Watchers

Forks

Languages