The SQL/Ibis powered sklearn of record linkage
-
Updated
May 13, 2024 - Python
Entity resolution (also known as data matching, data linkage, record linkage, and many other terms) is the task of finding entities in a dataset that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Entity resolution is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference.
The SQL/Ibis powered sklearn of record linkage
An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Backend (Docker & API) for matchID project
PySpark implementation of the Open Privacy Preserving Record Linkage (OPPRL) specification.
Supplementary code for "Class ratio and its implications for reproducibility and performance in record linkage" presented at The Pacific-Asia Conference on Knowledge Discovery and Data Mining 2024.
Interpretable metadata for the results of NHS England record linkage
Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.
An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
🔎 Finds fuzzy matches between datasets
🔎 Finds fuzzy matches between CSV files
Fast, accurate, open-source geocoding in Python
A convenient way to link, deduplicate, aggregate and cluster data(frames) in Python using deep learning
Example scripts for generating data with Gecko
Python library for the generation and mutation of realistic personal identification data at scale
LinkOrgs: An R package for linking linking records on organizations using half a billion open-collaborated records from LinkedIn
Record linkage - simple, flexible, efficient.
🕸️ Little helper for handling entity clusters
The StringMetrics project implements 7 string metric algorithms: Hamming, Dice, Jaro, Jaro-Winkler, Soundex, Levenshtein, and Damerau-Levenshtein. Metrics compare strings using IMetric interface providing an approximate similarity score from 0 (no match) to 1 (exact match) useful in data cleansing, record linkage, NLP, fraud detection, etc.
(Archived) A Python library for record linkage and deduplication.
Created by Halbert L. Dunn
Released 1946