Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECMClassifier returns almost all candidate pairs #193

Open
Evnsn opened this issue May 24, 2023 · 2 comments
Open

ECMClassifier returns almost all candidate pairs #193

Evnsn opened this issue May 24, 2023 · 2 comments

Comments

@Evnsn
Copy link

Evnsn commented May 24, 2023

import recordlinkage
from recordlinkage.index import Block
from recordlinkage.compare import String
from recordlinkage.datasets import load_febrl3

df, true_links = load_febrl3(True)

# Generate candidate pairs
indexer = recordlinkage.Index([
    Block("date_of_birth")
])

candidate_pairs = indexer.index(df)

print(len(candidate_pairs)) # Returns 5966

# Generate comparison vectors
comparer = recordlinkage.Compare([
    String("given_name", "given_name", method="jarowinkler", label="given_name"),
    String("surname", "surname", method="jarowinkler", label="surname"),
    String("soc_sec_id", "soc_sec_id", method="jarowinkler", label="soc_sec_id"),
    String("address_1", "address_1", method="jarowinkler", label="address_1"),
])

comparison_vector = comparer.compute(candidate_pairs, df)

# Match entities
ecm = recordlinkage.ECMClassifier(binarize=0.1)

pred_links = ecm.fit_predict(comparison_vector)

print(len(pred_links)) # Returns 5836

I attempted to replicate my problem in the code snippet above. There are 5966 candidate pairs and my ECM classifier returns 5836 of them as matches.

Problem: I want to use ECMClassifier for Entity matching. However, when I apply it to my dataset, ALL the candidate pairs are identified as matches, which is unfortunate.

Is there some parameter I can set to tweak the threshold for match vs non-match, or am I missing something else here?

@konsbn
Copy link

konsbn commented Jun 15, 2023

I think the threshold for binarizing is too low and you are thus converting all the feature vectors to 1 and getting all matches. Try increasing the binarize threshold

@Evnsn
Copy link
Author

Evnsn commented Jun 27, 2023

Thank you for the suggestion, unfortunately, it does not seem to not make any significant difference. I tried lowering and increasing the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants