Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM #2067

Open
samkodes opened this issue Mar 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

samkodes commented Mar 16, 2024

Is your proposal related to a problem?

Currently, the EMTrainingSession class has the default behaviour of disabling all comparisons used in blocking.
I believe that this is often unnecessary and discards information that may improve EM fitting. Specifically, this is the case whenever the blocking rule does not entail a specific match level.

For example, if we have a birthdate field and create one training block on year of birth, within this block it is still informative to distinguish between an exact birthdate match and an inexact match. While we would not want to save the estimated m-probabilities for the birthdate field to our model (because they are conditional on the blocking rule), estimating birthdate m-probabilities in this training session may still help improve the EM estimates for other parameters by affecting the overall match probability estimates. (Note that even if the assumption of field independence conditional on match status holds, there typically still is unconditional dependence, which is the problem here.)

Similarly, we may block on a name's initial or part of a postal code, and find value in distinguishing exact name or exact postal code matches during EM.

While the current implementation allows the user to specify comparisons to turn off (via the "comparisons_to_deactivate" parameter), passing an empty list is interpreted as disabling all variables used in the blocking rule (because 'not []' evaluates to True in the appropriate section of EMTrainingSession.init() ). Moreover, even if we could include all comparisons this way, we would not want comparisons related to the blocking rule to be saved back to the model.

Describe the solution you'd like

The parameter "comparisons_to_deactivate" should distinguish between the default value of None and a user-supplied value of an empty list ([]). The latter should mean "do not deactivate any comparisons", whereas "None" should mean "deactivate all comparisons related to blocking rules."

Separate logic will be required to avoid subsequently merging estimates for columns used in the blocking rules into the main linker's model. Unless there is a need to specify these manually, enforcing the default behaviour of not merging any comparison with a column used in the blocking rules makes sense.

Internally it will be necessary in the code to distinguish between comparisons suppressed for this training session and comparisons whose estimates we do not want to save to the global model.

It will also be necessary to force u-estimation on for comparisons related to the blocking rule since u-probabilities will in general be affected by the blocking rule.

Preserving backwards-compatibility for the behaviour of "comparisons_to_deactivate" will require a bit of thought, if that is a priority.

Describe alternatives you've considered

Additional context

(This problem came up while exploring a test implementation of #2030; variables I wanted to use to label cases using a semi-supervised approach were stripped out because they were used for blocking.)

@samkodes samkodes added the enhancement New feature or request label Mar 16, 2024
@samkodes samkodes changed the title [FEAT] Internally estimate probabilities for blocking-rule related comparisons to improve EM [FEAT] Internally estimate probabilities for blocking-rule-related comparisons to improve EM Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant