SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

This is our submitted solution for the SIGMOD 2022 Programming Contest.

Team Members

Advisors:

Yannis Foufoulas
Theofilos Mailis

Contest Results

11th place out of 55 teams
Total (average) Recall Score: 46.9% (1st place: 52.9%)
- 71% on D1 dataset
- 22.7% on D2 dataset
< Runtime to be found > (1st place: 1914 secs)

Task

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as D_i.

For each dataset D_i, participants are provided with the following resources:

X_i : a subset of the instances in D_i
Y_i : matching pairs in X_i x X_i. (The pairs not in Y_i are non-matching pairs.)
Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in Y_i are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id₁ and id₂ with id₁ < id₂, Y_i only includes (id₁, id₂) and doesn't include (id₂, id₁).

The goal is to write a program that generates, for each X_i dataset, a candidate set of tuple pairs for matching X_i with X_i. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left;_instance_id" and "right;_instance_id" and the output file must be named "output.csv;".; The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id₁ and id₂ (assume id₁ < id₂), we only include (id₁, id₂) and don't include (id₂, id₁) in "output.csv".

Solutions are evaluated over the complete dataset D_i. Note that the instances in D_i (except the sample X_i) are not provided to the participants. More details are available in the Evaluation Process section.

Both X_i and Y_i are in CSV format.

Example of dataset X_i <style> table { border-collapse:collapse } td, th { border:1px solid #ddd; padding:8px; } </style>

instance_id	attr_name_1	attr_name_2	...	attr_name_k
00001	value_1	null	...	value_k
00002	null	value_2	...	value_k
...	...	...	...	...

Example of dataset Y_i

left_instance_id	right_instance_id
00001	00002
00001	00003
...	...

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv

left_instance_id	right_instance_id
00001	00002
00001	00004
...	...

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X₁ and the remaining pairs are for datasets X₂. As a result, "output.csv" is formatted accordingly. You can check out the provided baseline solution on how to produce a valid "ouput.csv".

Solution Requirements

Python 3.8 or newer
pandas
frozendict
ReproZip was used for packing the solution and executing the submitted solutions, but is not required.

Compatibility

Python Versions:
- Python 3.8.10
- PyPy 7.3.9 (Python 3.9.2)
OS:
- WSL Ubuntu 20.04

Repository Content

baseline directory:
- blocking.py: The provided baseline solution
datasets directory:
- X1.csv (X1 dataset) & Y1.csv (matching pairs for X1)
- X2.csv (X2 dataset) & Y2.csv (matching pairs for X2)
output_misc directory: To store secondary .csv files, used for analyzing the main output.csv file (see below)
src directory:
- Submitted files:
  - run.py: Starting point of the solution
  - x1_blocking.py: X1-specific solution logic, definitions & routines
  - x2_blocking.py: X2-specific solution logic, definitions & routines
  - utils.py: General definitions used by both solutions
- output.csv: Non-formatted output for the given X1 dataset
- Scripts for quick usage of ReproZip:
  - traceAndPack.sh: Run run.py and pack the execution in submission.rpz
  - cleanReprozip.sh: Clean all files and directories generated by ReproZip (including submission.rpz)
- Scripts for analyzing the solution performance & output.csv
  - compare.py: Find correct, missed & false positive pairs in output.csv and store them (with titles) in corresponding .csv files, in the output_misc directory. Also display the number of pairs in each category, as well as the Recall score.
  - Bash scripts for separating the .csv files generated by compare.py by brand, and storing the brand-specific .csv's in output_misc/false, output_misc/missed and output_misc/common.

Execution

In src directory:

Simple Execution:
- Choose the desired Dataset to run the experiment on: In utils.py, set TARGET_DATASET accordingly.
- If you wish to format output.csv to have precisely 3,000,000 rows, set SUBMISSION_MODE to True. To skip the solution for a dataset, set IGNORE_DATASET to '1' or '2' ('' to not skip).
- Run the solution:
```
  python3 run.py
```
- To see the stats for the answer generated by the solution:
```
  python3 compare.py
```
Execute & Pack with ReproZip
- Select the desired dataset & parameters as above.
- Run the solution & pack in submission.rpz:
```
  ./traceAndPack
```
- To clean-up the generated files, including submission.rpz:
```
  ./cleanReprozip
```

Algorithm Description

...

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
baseline		baseline
datasets		datasets
output_misc		output_misc
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baseline

baseline

datasets

datasets

output_misc

output_misc

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

Team Members

Contest Results

Task

Solution Requirements

Compatibility

Repository Content

Execution

Algorithm Description

About

Releases

Packages

Contributors 3

Languages

theodoratrz/sigmodContest2022

Folders and files

Latest commit

History

Repository files navigation

SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens

Team Members

Contest Results

Task

Solution Requirements

Compatibility

Repository Content

Execution

Algorithm Description

About

Topics

Resources

Stars

Watchers

Forks

Languages