ProxCluster

ProxCluster is a modularized framework for Incremental Entity Resolution that leverages concepts similar to the K-Means algorithm to cluster the duplicates found. This work was developed as the final paper for my Bachelor degree in Computer Science - UFAPE.

👉Undergraduate thesis: https://github.com/Gust4voSales/proxcluster-deduplicator/blob/main/.github/ProxCluster-TCC.pdf

🔎 Overview

English legend

^{1: Block records}
^{2: Return blocks}
^{3: IF there is any clusters with the same blocking key value (BKV)}
^{4: ELSE}
^{5: Cluster incremental block}
^{6: Return clusters}
^{7: Statically cluster the block}
^{8: Evaluate clusters}
^{9: Return performance}

ProxCluster is divided into 3 modules.

Blocking Module (SoundexBlocking and PhonexStaticBlocking) - Responsible for blocking the records (very important step for performance)
Matching Module (ProxCluster) - Responsible for comparing, classifying and clustering the records
Evaluation Module (Evaluator) - Responsible for evaluating the clusters with a provided gold standard

Even though these modules can be used separately, the Deduplicator class integrates all of them, providing an easy to use abstraction to deduplicate your records with an incremental strategy^* included.

In the sequence diagram above, we have the sequence of steps that are taken by the Deduplicator, starting with the blocking of records using the SoundexBlocking module. The returned blocks are traversed in a loop where clustering is performed using the ProxCluster module for each block in the iteration. Note that it is also the responsibility of the Deduplicator to select the previously resolved clusters in the case of Incremental Entity Resolution^\* of blocks with the same BKV. Finally, the performance of the generated clusters can be evaluated using the Evaluator class.

^* In this work Incremental Entity Resolution is a strategy that makes possible to reuse the results (duplicates found in previous processes) when new records arrive, on the contrary of the static approach, where we would need to reprocess everything (old and new records) from the start.

🛠 Installation

Clone this repo running on your terminal git clone https://github.com/Gust4voSales/proxcluster-deduplicator
Install dependencies with pip install -r requirements.txt
You're all set to start using the framework, checkout the experiments folder

🧪 Experiments

These were the experiments from the thesis, you can use them as examples to use the framework.

static_vs_incremental experiments are a good way to start, they are comparing the static approach vs incremental in three different datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

datasets

datasets

experiments

experiments

modules

modules

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

ProxCluster

🔎 Overview

🛠 Installation

🧪 Experiments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github		.github
datasets		datasets
experiments		experiments
modules		modules
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Gust4voSales/proxcluster-deduplicator

Folders and files

Latest commit

History

Repository files navigation

ProxCluster

🔎 Overview

🛠 Installation

🧪 Experiments

About

Topics

Resources

Stars

Watchers

Forks

Languages