Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster ID is different for exact records #88

Open
dorg-ekrolewicz opened this issue Sep 13, 2018 · 2 comments
Open

Cluster ID is different for exact records #88

dorg-ekrolewicz opened this issue Sep 13, 2018 · 2 comments

Comments

@dorg-ekrolewicz
Copy link

screen shot 2018-09-13 at 1 08 00 pm 1

In the results shown above, the algorithm does a great job of assigning Cluster ID = 0 for a contact with various title changes, but for some reason it assigns a different cluster ID for identical rows ("Christine Wack" has multiple cluster ID's). Christine's case seems to be the trivial one, why would we get different cluster ID's then? (same goes for Tom Baty)

Any advice/help on where to look is much appreciated.

@batesmotel34
Copy link

batesmotel34 commented Jun 3, 2019

The current code misses any records with exact duplicates which don't also have any near duplicates that appear in clustered_dupes returned from deduper.match() in csvddedupe.py:.

Original code:

`
clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes

`
Code with a fix that works locally for exact duplicates not caught above:
<

    clustered_dupes = deduper.match(unique_d, threshold)

    expanded_clustered_dupes = []

   rows_used = []
    for cluster, scores in clustered_dupes:
        new_cluster = list(cluster)
        new_scores = list(scores)
        for row_id, score in zip(cluster, scores):
            children = parents.get(row_id, [])
            new_cluster.extend(children)
            new_scores.extend([score] * len(children))
        expanded_clustered_dupes.append((new_cluster, new_scores))

    # Add any parents with no clustered exact_dups but with exact dupes to expanded_clustered_dupes
    # or else they are omitted counted as non duplicates.
    for row, exact_dups in parents.items():
        if row not in rows_used and exact_dups is not None and len(exact_dups) > 0:
            new_cluster = [row]
            new_cluster.extend(exact_dups)
            new_scores = [1.0]
            new_scores.extend([1.0] * len(exact_dups))
            expanded_clustered_dupes.append((new_cluster, new_scores))

    clustered_dupes = expanded_clustered_dupes

@zacharysyoung
Copy link

zacharysyoung commented Nov 24, 2021

Wow! This is such an important issue. I'm very grateful for @dorg-ekrolewicz calling it out, and especially for @batesmotel34 offering a fix.

I ran this on a file with a little over 12K rows and the difference in what I'm calling "single-row clusters" is ~10000 rows without the fix, and only ~1500 rows with the fix...

That's 8500 false negatives the fix converted to true-duplicate clusters.

wiktorek140 added a commit to wiktorek140/csvdedupe that referenced this issue Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants