incremental matching #1183
Replies: 5 comments
-
@fgregg .. is there an example anywhere around performing incremental matching with dedupe? Is it feasible to be able to do this? |
Beta Was this translation helpful? Give feedback.
-
once you have a good set of matches, you can turn those into gazetteer matches. the gazetteer class has methods for adding new records to the gazetteer. |
Beta Was this translation helpful? Give feedback.
-
Excited to hear its feasible.. Sorry for not seeing it clearly.. but is https://dedupeio.github.io/dedupe-examples/docs/gazetteer_example.html an example, where canon_file is the pre-matched data and messy_file is the new records.. I'll dig into that if so.. thanks |
Beta Was this translation helpful? Give feedback.
-
if I'm reading the that example right.. the canon_file is loaded entirely into the .index method of the gazetteer, and then I can cycle through the messy_file records asking the gazetteer to .search for possible matches, and I get back the matches as a tuple with the messy id and the canon id and a score.. Seems like a 'settings' file can provide other details (i've yet to see what is in these settings files, so I assume it has the training data, various parameters.. not sure what else). If i'm down the wrong path.. please fire me even just pseudo code steps as to how to run an incremental match.. Thanks @fgregg ! |
Beta Was this translation helpful? Give feedback.
-
@fgregg ... is there an example of using the gazetteer in a way that avoids loading the entire canon_file into memory? |
Beta Was this translation helpful? Give feedback.
-
Say I've run dedup on millions of records, and now have the entity map sorted out, and it all looks good - all my matched records are grouped up by their canon id. (i use "canon id" from the example https://dedupeio.github.io/dedupe-examples/docs/mysql_example.html)
I now have 1 new record that comes along and i just want to add it to the best group possible (or confirm it has no good match).. Is there a way to do that without rerunning matching on everything?
What is the best way to do this, with performance in mind?
I don't want to re-process the millions of records - i trust that they are not going to change.. I just want to 'add' this new record to one of the canon id's.
The blocking map table is also still available for the previous run with millions of records... if there is a way to make use of it
Beta Was this translation helpful? Give feedback.
All reactions