Skip to content

Evnsn/awsome-entity-resolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

(working progress) Awesome Entity Resolution Awesome

A collection of awesome resources regarding Entity Resolution.

Table of Contents

  • Books
  • Papers
  • Frameworks
  • Datasets
  • Projects
  • Miscellaneous

πŸ‘‰ What's Record Linkage?

Entity Resolution (ER) aims to identify different descriptions that refer to the same real-world object. Detecting entities stored in the same database is refeerd to as deduplication, while record linkage refeers to detectation in two different databases.

πŸ“š Books

  1. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen (2012)
  2. Data Quality and Record Linkage Techniques by Thomas N. Herzog, Fritz J. Scheuren & William E. Winkler (2007)

πŸ“ƒ Papers

Surveys

  • 2020 | An Overview of End-to-End Entity Resolution for Big Data | Vassilis Christophides, et al. | pdf
  • 2012 | A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication | Peter Christen | pdf
  • 2007 | Duplicate Record Detection: A Survey | Ahmed K. Elmagarmid, et a.l | pdf

Entity Matching Management Systems

  • 2016 | Magellan: Toward Building Entity Matching Management Systems | Pradap Konda, et al. | pdf | git

Generic Entity Resolution Techniques

Indexing

Debugging of blocking

Pair compairinson

...

Miscellaneous (impactful papers?)

  • 1969 | A Theory for Record Linkage | Fellegi, I.P., Sunter, A.B. | pdf

Classification

Supervised

...

Unsupervised

  • 2019 | Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records | T. Enamorado, et al. | pdf | GiT

Clustering /

  • 2020 | Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications | Yan Yan, et al. | pdf

πŸ”¨ Frameworks

Table 1 is a composition of tools presented in (2020, V. Christophides), (2015, P. Konda) and J535D165/data-matching-software.

Table 1: Blocking: Attribute equivalence (AE), Blocking index (BI), Canopy clustering (CC), Canopy index (CI), Clustering (C), Expectation maximization (EM), Full index (FI), Hash-based (HB), Hybrid (H), Induction (I), Predicate-based (PB), Probabilistic (P), Relational clustering (RC), Rule-based (RB), Sorted neighborhood (SN) Sorting index (SoI), Stringmap index (StI), Suffixarray index (SuI). Matching: Agglomerative hierarchical clustering-based (AHC), Decision trees (DT), Farthest First (FF), Fellegi-Sunter (FS), k-Nearest-neighbour (KNN), Logistic regression (LR), Optimal threshold (OT), Support vector machine (SVM) TwoStep (TS).

Tools Blocking Matching Clustering UI Scaling Language OSS GiT/Inst Paper
Active Atlas HB DT --- GUI, CMD ❌ Java ❌ --- ---
Atyimo --- --- --- --- --- Python --- git ---
BigMatch AE, RB ❌ --- CMD βœ”οΈ C ❌ --- (2002, W. E. Yancey)
D-Dupe AE RC --- GUI, CMD ❌ C# ❌ --- (2006, M. Bilgic)
Dedoop AE, SN DT, LR, SVM, etc --- GUI Hadoop Java ❌ install (2012, L Kolb)
Dedupe CC, PB AHC ❌ API, CMD βœ”οΈ Python βœ”οΈ git (2003, M. Bilenko), (2006, M. Bilenko)
DuDe SN RB ❌ CMD ❌ Java βœ”οΈ install (2010, U. Draisbach)
Duke βœ”οΈ βœ”οΈ --- CMD ❌ Java --- git Blog: (2011, L. Marius)
FAMER ❌ ❌ βœ”οΈ --- Apache Flink --- --- gitlab (2018, A Saeedi)
fastLink ❌ --- --- API βœ”οΈ R βœ”οΈ git (2017, T. Enamorado)
Febrl BI, CI, FI, SoI, StI, SuI, Q-gram FS, OT, K-means, FF, SVM, TS ❌ GUI ❔ Python βœ”οΈ install (2013, P. Christen)
FRIL AE, SN EM ❌ GUI ❔ Java βœ”οΈ install (2008, P Jurczyk)
JedAI βœ”οΈ βœ”οΈ βœ”οΈ GUI Apache Spark Java βœ”οΈ git (2020, G. Papadakis)
KnoFuss βœ”οΈ βœ”οΈ --- --- ❌ Java --- --- (2008, A. Nikolov)
LIMES --- --- --- GUI ❌ Java βœ”οΈ git (2011, A. C. N. Ngomo)
Magellan βœ”οΈ βœ”οΈ ❌ API, GUI Apache Spark Python βœ”οΈ git (2016, P. Konda)
MARLIN CC DT, SVM --- ❌ --- --- --- --- (2004, M. Bilenko)
Merge Toolbox AE, CC P, EM --- GUI ❌ Java ❌ install (2004, R. Schnell)
MinoanER βœ”οΈ βœ”οΈ ❌ GUI Apache Spark Java βœ”οΈ --- (2019, V. Efthymiou)
NADEEF --- RB --- GUI ❌ Java ❌ --- (2013, M. Dallachiesa)
OYSTER AE RB --- CMD ❌ Java βœ”οΈ install (2011, E. D. Nelson)
PRIL --- --- --- GUI --- C# --- git (2018, C. T. Rentsch)
pydedupe AE KNN, K-means, RB --- CMD ❌ Python βœ”οΈ git ---
Reclin2 --- --- --- API --- R --- git ---
RELAIS --- --- --- GUI --- R/Java --- install (2006, M. Fortini)
RLTK --- --- --- API --- Python βœ”οΈ git ---
Record Linkage (R) AE ML-based --- CMD ❌ R βœ”οΈ cran (2011, M Sariyar)
Record Linkage (Python) FI, BI, SN DC, LR, SVM, K-means, EM --- API ❌ Python βœ”οΈ git 2015, inspired by FEBRL
SERIMI βœ”οΈ βœ”οΈ --- --- --- Ruby --- git (2015, S Araujo)
SERF --- R-swoosh --- CMD ❌ Java ❌ git (2009, O. Benjelloun)
Splink βœ”οΈ EM, etc? βœ”οΈ API, GUI Apache Spark Python βœ”οΈ git 2019, same as fastLink
Silk --- RB --- GUI Hadoop Scala βœ”οΈ git, install (2009, J. Volz)
TAILOR AE, SN P, C, H, I --- GUI ❌ Java ❌ --- (2002, M. G. Elfeky)
WHIRL --- --- --- CMD ❌ C++ ❌ install (2000, W.W Cohen)

Datasets

πŸ“Œ Miscellaneous