incremental matching #1183

rderidder-lda · 2024-01-24T21:30:13Z

rderidder-lda
Jan 24, 2024

Say I've run dedup on millions of records, and now have the entity map sorted out, and it all looks good - all my matched records are grouped up by their canon id. (i use "canon id" from the example https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html)

I now have 1 new record that comes along and i just want to add it to the best group possible (or confirm it has no good match).. Is there a way to do that without rerunning matching on everything?
What is the best way to do this, with performance in mind?

I don't want to re-process the millions of records - i trust that they are not going to change.. I just want to 'add' this new record to one of the canon id's.
The blocking map table is also still available for the previous run with millions of records... if there is a way to make use of it

rderidder-lda · 2024-01-25T18:00:28Z

rderidder-lda
Jan 25, 2024
Author

@fgregg .. is there an example anywhere around performing incremental matching with dedupe? Is it feasible to be able to do this?
Thanks!

0 replies

fgregg · 2024-01-25T18:10:46Z

fgregg
Jan 25, 2024
Maintainer

once you have a good set of matches, you can turn those into gazetteer matches. the gazetteer class has methods for adding new records to the gazetteer.

0 replies

rderidder-lda · 2024-01-25T18:46:22Z

rderidder-lda
Jan 25, 2024
Author

Excited to hear its feasible.. Sorry for not seeing it clearly.. but is https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html an example, where canon_file is the pre-matched data and messy_file is the new records.. I'll dig into that if so.. thanks

0 replies

rderidder-lda · 2024-01-25T20:11:00Z

rderidder-lda
Jan 25, 2024
Author

if I'm reading the that example right.. the canon_file is loaded entirely into the .index method of the gazetteer, and then I can cycle through the messy_file records asking the gazetteer to .search for possible matches, and I get back the matches as a tuple with the messy id and the canon id and a score.. Seems like a 'settings' file can provide other details (i've yet to see what is in these settings files, so I assume it has the training data, various parameters.. not sure what else).
So just not sure about if I can make use of existing block map of the huge list of canon records, or if it has to recreate it or its not needed for the search somehow..
And not sure if there is a way to improve on the .index method having to load in the entire set of canon_file data into memory..

If i'm down the wrong path.. please fire me even just pseudo code steps as to how to run an incremental match.. Thanks @fgregg !

0 replies

rderidder-lda · 2024-02-14T16:29:25Z

rderidder-lda
Feb 14, 2024
Author

@fgregg ... is there an example of using the gazetteer in a way that avoids loading the entire canon_file into memory?
Similar to the my_sql example that uses generator functions.. but for an incremental match.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incremental matching #1183

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

incremental matching #1183

rderidder-lda Jan 24, 2024

Replies: 5 comments

rderidder-lda Jan 25, 2024 Author

fgregg Jan 25, 2024 Maintainer

rderidder-lda Jan 25, 2024 Author

rderidder-lda Jan 25, 2024 Author

rderidder-lda Feb 14, 2024 Author

rderidder-lda
Jan 24, 2024

rderidder-lda
Jan 25, 2024
Author

fgregg
Jan 25, 2024
Maintainer

rderidder-lda
Jan 25, 2024
Author

rderidder-lda
Jan 25, 2024
Author

rderidder-lda
Feb 14, 2024
Author