Cluster ID is different for exact records #88

dorg-ekrolewicz · 2018-09-13T21:16:22Z

In the results shown above, the algorithm does a great job of assigning Cluster ID = 0 for a contact with various title changes, but for some reason it assigns a different cluster ID for identical rows ("Christine Wack" has multiple cluster ID's). Christine's case seems to be the trivial one, why would we get different cluster ID's then? (same goes for Tom Baty)

Any advice/help on where to look is much appreciated.

batesmotel34 · 2019-06-03T19:13:58Z

The current code misses any records with exact duplicates which don't also have any near duplicates that appear in clustered_dupes returned from deduper.match() in csvddedupe.py:.

Original code:

`
clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes

`
Code with a fix that works locally for exact duplicates not caught above:
<

    clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []

   rows_used = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    # Add any parents with no clustered exact_dups but with exact dupes to expanded_clustered_dupes
    # or else they are omitted counted as non duplicates.
    for row, exact_dups in parents.items():
        if row not in rows_used and exact_dups is not None and len(exact_dups) > 0:
            new_cluster = [row]
            new_cluster.extend(exact_dups)
            new_scores = [1.0]
            new_scores.extend([1.0] * len(exact_dups))
            expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes

zacharysyoung · 2021-11-24T09:16:31Z

Wow! This is such an important issue. I'm very grateful for @dorg-ekrolewicz calling it out, and especially for @batesmotel34 offering a fix.

I ran this on a file with a little over 12K rows and the difference in what I'm calling "single-row clusters" is ~10000 rows without the fix, and only ~1500 rows with the fix...

That's 8500 false negatives the fix converted to true-duplicate clusters.

dedupeio#88

wiktorek140 added a commit to wiktorek140/csvdedupe that referenced this issue Mar 11, 2022

Update as per

1087963

dedupeio#88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster ID is different for exact records #88

Cluster ID is different for exact records #88

dorg-ekrolewicz commented Sep 13, 2018

batesmotel34 commented Jun 3, 2019 •

edited

zacharysyoung commented Nov 24, 2021 •

edited

Cluster ID is different for exact records #88

Cluster ID is different for exact records #88

Comments

dorg-ekrolewicz commented Sep 13, 2018

batesmotel34 commented Jun 3, 2019 • edited

zacharysyoung commented Nov 24, 2021 • edited

batesmotel34 commented Jun 3, 2019 •

edited

zacharysyoung commented Nov 24, 2021 •

edited