New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster ID is different for exact records #88
Comments
The current code misses any records with exact duplicates which don't also have any near duplicates that appear in clustered_dupes returned from deduper.match() in Original code: `
`
|
Wow! This is such an important issue. I'm very grateful for @dorg-ekrolewicz calling it out, and especially for @batesmotel34 offering a fix. I ran this on a file with a little over 12K rows and the difference in what I'm calling "single-row clusters" is ~10000 rows without the fix, and only ~1500 rows with the fix... That's 8500 false negatives the fix converted to true-duplicate clusters. |
In the results shown above, the algorithm does a great job of assigning Cluster ID = 0 for a contact with various title changes, but for some reason it assigns a different cluster ID for identical rows ("Christine Wack" has multiple cluster ID's). Christine's case seems to be the trivial one, why would we get different cluster ID's then? (same goes for Tom Baty)
Any advice/help on where to look is much appreciated.
The text was updated successfully, but these errors were encountered: