Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking as a feature for scoring #1103

Open
fgregg opened this issue Sep 24, 2022 · 1 comment
Open

Blocking as a feature for scoring #1103

fgregg opened this issue Sep 24, 2022 · 1 comment

Comments

@fgregg
Copy link
Contributor

fgregg commented Sep 24, 2022

Right now, blocking and scoring are two distinct phases.

All the information about how two records came to be blocked together is unused by the scorer. This is a bit silly, as the fact that two records are blocked together by multiple predicates could be a pretty good indicator of co-reference.

I'm not really clear what the best way to take advantage of blocking information in scoring is though.

a few ideas:

  1. ensemble model. Treat each each blocking predicate as a classifier, and put them in an ensemble with the scorer
  2. blocking as feature: add dummy features indicating which predicate rules are cover a pair. these features get fed into the scorer

In both cases, i'm not quite sure how to set up the training.

@NickCrews
Copy link
Contributor

Splink uses something very similar to method 2. See https://youtu.be/msz3T741KQI?t=2035 for a nice way of how they think about the different "types" of comparisons that can happen. The whole video had some other great thoughts and visualizations in there too I thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants