Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't sensibly supply comparison_levels_to_reverse_blocking_rule in Splink 4 #2016

Open
ADBond opened this issue Mar 1, 2024 · 1 comment

Comments

@ADBond
Copy link
Contributor

ADBond commented Mar 1, 2024

In linker.estimate_parameters_using_expectation_maximisation there is an option to manually supply comparison_levels_to_reverse_blocking_rule, which take ComparisonLevel objects. However in Splink 4 most users won't deal with these objects directly, instead using ComparisonLevelCreator objects which build these behind-the-scenes.

Right now, a user would have to do something like this:

...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
    "l.postcode = r.postcode",
    comparison_levels_to_reverse_blocking_rule=[linker._settings_obj.comparisons[0].comparison_levels[2], ...]
)

My proposal is introducing a (unique) name to each ComparisonLevel, which we can use to refer to these - this will be systematically created if not user-supplied. Comparison levels would have a fully unique name in the format "{comparison_name}.{comparison_level_name}".
We already have output_column_name for Comparison which works this way, but I wonder if we shouldn't also include a name for consistency (but maybe that just complicates things, idk).

With this the above snippet would be something like:

...
linker = Linker(df, settings, db_api)
linker.estimate_parameters_using_expectation_maximisation(
    "l.postcode = r.postcode",
    comparison_levels_to_reverse_blocking_rule=["location.exact_match", ...]
)

This would also mean we can use this to get levels/comparisons directly as we may sometimes wish to do, without needing to go via gamma-values (and remember the numbering scheme).

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Whilst there are some edges cases in which this setting may be useful, I think it might be able to be removed.

When Splink3 was first written, it was assumed that the user wanted to train lambda (probability_two_random_records_match) using EM. We therefore needed to implement both an upward adjustment to probability_two_random_records_match for training, and then reverse this back out to estimate lambda. We now no long advise this and insead suggest the use of linker.estimate_probability_two_random_records_match

In terms of the high level purpose:

  • We have a global probability_two_random_records_match
  • When EM training, we need a probability_two_random_records_match specific to the blocking rule, which is much higher than the global probability_two_random_records_match
  • We allow probability_two_random_records_match to vary during EM training but then throw away the final value.
  • But it's desirable for the starting value for probability_two_random_records_match to be close to the true value, so some adjustment is merited
  • Assuming conditional independence we can work out the upward adjustment from the global probability_two_random_records_match by looking at the u parameter on exact match

But I'm starting to wonder whether we can get to this in a better and simpler way - namely simply computing the reduction in the number of comparisons that results from the blocking rule. i.e. how many comparisons with no blocking rule vs how many comparisons from the EM training blocking rule

This methodology would also get around the fact the the current approach assumes conditional independence e.g. it looks separately for an exact match on first name and surname and multiplies them, but in reality these are correlated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants