Skip to content

theodoratrz/sigmodContest2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

85 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SIGMOD 2022 Programming Contest

Team "BringBackML" - National & Kapodistrian University Of Athens


This is our submitted solution for the SIGMOD 2022 Programming Contest.

Team Members

Advisors:

  • Yannis Foufoulas
  • Theofilos Mailis

Contest Results

  • 11th place out of 55 teams
  • Total (average) Recall Score: 46.9% (1st place: 52.9%)
    • 71% on D1 dataset
    • 22.7% on D2 dataset
  • < Runtime to be found > (1st place: 1914 secs)

Task

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as Di.

For each dataset Di, participants are provided with the following resources:

  • Xi : a subset of the instances in Di
  • Yi : matching pairs in Xi x Xi. (The pairs not in Yi are non-matching pairs.)
  • Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in Yi are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, Yi only includes (id1, id2) and doesn't include (id2, id1).

The goal is to write a program that generates, for each Xi dataset, a candidate set of tuple pairs for matching Xi with Xi. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left;_instance_id" and "right;_instance_id" and the output file must be named "output.csv;".; The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), we only include (id1, id2) and don't include (id2, id1) in "output.csv".

Solutions are evaluated over the complete dataset Di. Note that the instances in Di (except the sample Xi) are not provided to the participants. More details are available in the Evaluation Process section.

Both Xi and Yi are in CSV format.

Example of dataset Xi <style> table { border-collapse:collapse } td, th { border:1px solid #ddd; padding:8px; } </style>

instance_idattr_name_1attr_name_2...attr_name_k
00001value_1null...value_k
00002nullvalue_2...value_k
...............

Example of dataset Yi

left_instance_idright_instance_id
0000100002
0000100003
......

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv

left_instance_idright_instance_id
0000100002
0000100004
......

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. As a result, "output.csv" is formatted accordingly. You can check out the provided baseline solution on how to produce a valid "ouput.csv".

Solution Requirements

  • Python 3.8 or newer
  • pandas
  • frozendict
  • ReproZip was used for packing the solution and executing the submitted solutions, but is not required.

Compatibility

  • Python Versions:
    • Python 3.8.10
    • PyPy 7.3.9 (Python 3.9.2)
  • OS:
    • WSL Ubuntu 20.04

Repository Content

  • baseline directory:
    • blocking.py: The provided baseline solution
  • datasets directory:
    • X1.csv (X1 dataset) & Y1.csv (matching pairs for X1)
    • X2.csv (X2 dataset) & Y2.csv (matching pairs for X2)
  • output_misc directory: To store secondary .csv files, used for analyzing the main output.csv file (see below)
  • src directory:
    • Submitted files:
      • run.py: Starting point of the solution
      • x1_blocking.py: X1-specific solution logic, definitions & routines
      • x2_blocking.py: X2-specific solution logic, definitions & routines
      • utils.py: General definitions used by both solutions
    • output.csv: Non-formatted output for the given X1 dataset
    • Scripts for quick usage of ReproZip:
      • traceAndPack.sh: Run run.py and pack the execution in submission.rpz
      • cleanReprozip.sh: Clean all files and directories generated by ReproZip (including submission.rpz)
    • Scripts for analyzing the solution performance & output.csv
      • compare.py: Find correct, missed & false positive pairs in output.csv and store them (with titles) in corresponding .csv files, in the output_misc directory. Also display the number of pairs in each category, as well as the Recall score.
      • Bash scripts for separating the .csv files generated by compare.py by brand, and storing the brand-specific .csv's in output_misc/false, output_misc/missed and output_misc/common.

Execution

In src directory:

  • Simple Execution:

    • Choose the desired Dataset to run the experiment on: In utils.py, set TARGET_DATASET accordingly.

    • If you wish to format output.csv to have precisely 3,000,000 rows, set SUBMISSION_MODE to True. To skip the solution for a dataset, set IGNORE_DATASET to '1' or '2' ('' to not skip).

    • Run the solution:

        python3 run.py
      
    • To see the stats for the answer generated by the solution:

        python3 compare.py
      
  • Execute & Pack with ReproZip

    • Select the desired dataset & parameters as above.

    • Run the solution & pack in submission.rpz:

        ./traceAndPack
      
    • To clean-up the generated files, including submission.rpz:

        ./cleanReprozip
      

Algorithm Description

...

About

Submitted solution for the ACM SIGMOD 2022 Programming Contest πŸ’»πŸ…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published