Name		Name	Last commit message	Last commit date
parent directory ..
data		data
img		img
README.md		README.md
data_matching_learned_settings		data_matching_learned_settings
data_matching_output.csv		data_matching_output.csv
fields.json		fields.json
link_records.py		link_records.py
record_pairs.html		record_pairs.html
requirements.txt		requirements.txt

README.md

dedupe.io with Prodi.gy

This is a custom recipe for linking records across multiple datasets using the Python dedupe library. See https://github.com/dedupeio/dedupe-examples/tree/master/record_linkage_example for an example of linking records with dedupe's console labeler to compare.

Usage

Install Prodi.gy

Once Prodigy is installed, you should be able to run the prodigy command from your terminal, either directly or via python -m:

Install Requirements

pip install -r requirements.txt

Run with example datasets

python -m prodigy records.link my_dataset --left data/raw_dedupe_abtbuy_abt.csv --right data/raw_dedupe_abtbuy_buy.csv --fields fields.json -F ./link_records.py

Annotating

In the interface, a row is highlighted green if the field has an exact string match across both datasets, otherwise the row will be green.

If you think the records are duplicates like they are in the image above, accept, otherwise reject.

When you click the save button your progress will be updated.

In order to reach 100% progress, the dedupe library recommends at least 10 positive and 10 negative examples.

Model training

Once you end the annotation session, a model will be batch trained and evaluated on the rest of your dataset and will write out records the model think should be conflated together to a file named data_matching_output.csv and save a copy of the dedupe model settings to data_matching_learned_settings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedupe

dedupe

data

data

img

img

README.md

README.md

data_matching_learned_settings

data_matching_learned_settings

data_matching_output.csv

data_matching_output.csv

fields.json

fields.json

link_records.py

link_records.py

record_pairs.html

record_pairs.html

requirements.txt

requirements.txt

README.md

dedupe.io with Prodi.gy

Usage

Annotating

Model training

Files

dedupe

Directory actions

More options

Directory actions

More options

Latest commit

History

dedupe

Folders and files

parent directory

dedupe.io with Prodi.gy

Usage

Annotating

Model training