Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Blocking api redesign #2136

Closed
wants to merge 22 commits into from
Closed

(WIP) Blocking api redesign #2136

wants to merge 22 commits into from

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Apr 7, 2024

Summary

This PR redesigns the user-facing API for the analysis of blocking rules.

  • It allows the analysis of blocking rules to be conducted without the user needing to have constructed a settings object.
  • It consolidates existing functions into just two user-facing functions without reducing functionality

Existing functions

At the moment we have the following functions for analysing blocking rules:

  1. linker.count_num_comparisons_from_blocking_rule
  2. linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
  3. linker.cumulative_comparisons_from_blocking_rules_records
  4. linker.cumulative_num_comparisons_from_blocking_rules_chart
  5. linker.count_num_comparisons_from_blocking_rules_for_prediction

1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.

Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.

Proposal

  • Remove functions from linker and make them available from splink.blocking_analysis
  • Combine 1 and 2 into a single function count_comparisons_from_blocking_rule
  • Combine 3, 4 and 5 into a two functions:
    • cumulative_comparisons_from_blocking_rules_chart
    • cumulative_comparisons_from_blocking_rules_records(include_total:bool)
from splink.blocking_analysis import (
    count_comparisons_from_blocking_rule
)

# method can be 'fast_estimate', or 'slow_precise'
count_comparisons_from_blocking_rule(br,  database_api, method="fast_estimate", max_threshold=1e8)

cumulative_comparisons_from_blocking_rules_chart(threshold_for_exact_analysis=1e8)
cumulative_comparisons_from_blocking_rules_records(threshold_for_exact_analysis=1e8)`

Discussion

There are two areas of complexity with analysis of blocking rules:

  • Interactions between blocking rules: when we have multiple blocking rules, some will generate the same comparisons. As a result, when we analyse multiple blocking rules, we're usually interested in the cumulative number of blocking rules after deduplication.
  • We have a 'fast/estimate' and a 'slow/precise' method for counting the number of comparisons generated by a single blocking rule. Only the 'slow/precise' on is able to detect duplicates across blocking rules, and therefore can be used to compute cumulative comparisons .

For more details on the 'fast/estimate' technique, see here.

One reason the 'fast/estimate' is important is that a key objective of of blocking rule analysis is to reject blocking rules that generate too many comparisons to be computationally feasible (e.g. block_on("first_name") could generate 1trn comparisons). But if you use the 'slow/precise' methodology to attempt to detect this, the computation will never end. So when using the slow/precise method, it's generally desirable to 'test' everything using the 'fast/estimate' method to ensure the user isn't asking for a computation that will never complete.

Initial work

I've started an initial go that looks like this, but concluded that, prior to getting this going, we need to allow creating __splink__df_concat without the need for a linker

def count_comparisons_from_blocking_rules(
    table_or_tables,
    *,
    blocking_rule_or_rules: blocking_rule_or_rules_type,
    link_type: link_type_type,
    database_api: DatabaseAPI,
    method: str = "comparisons_generated",
):

Complexity

In order to compute the cumulative/marginal number of comparisons created by a set of blocking rules, we need to use a linker.

This is because of the complexity of the logic, particularly around exploding blocking rules.

Each blocking rule needs to be 'aware' of the comparisons generated by the previous rule, and in the case of exploding blocking rules, this involves creating a lookup table of ID-ID pairs that have been created.

As a result, we end up in a less-than-ideal situation in which when writing count_comparisons_to_be_scored_from_blocking_rules function that exists independently of the main linker, we end up needing to create a linker under the hood so the SQL is correctly generated

TODO/Check

[ ] - Tidy up logic for making BlockingRule/BlockingRuleCreator uniform. Do type hinting
[ ] - Move internal logic to splink.internals
[ ] - Check both pure function, and use within 'estimate_probability_two_random_records_match' are consistent and correct in terms of the count of marginal rules, by comparing against Splink 3
[ ] - Check it works with spark and exploding blocking rules
[ ] - Check with works with duckdb irrespective of the type of input data (pandas df, duckdb tablename, etc)

  • Type hint block_using_rules_sqls correctly and check arguments make sense
  • Check deterministic_link works correctly with two dataset link only
  • Check predict works correctly with two dataset link only
  • Check estimate_probability_two_random_records_match works

@RobinL RobinL changed the base branch from master to splink4_dev April 7, 2024 09:43
@RobinL RobinL force-pushed the blocking_api_redesign branch 2 times, most recently from a59041e to dc0bf04 Compare May 1, 2024 07:49
@RobinL
Copy link
Member Author

RobinL commented May 9, 2024

Todo/issues:

  • The blocking rules that go into `cumulative_comparisons_to_be_scored_from_blocking_rules_from_linker' to be 'aware' of the other blocking rules, because this determines match key and the logic for eliminating pairs created by other blocking rules. This is really why we need the linker because of exploding rules etc.

  • This needs to work with the linker.estimate_probability_two_random_records_match

Separately, probably want to do a PR that makes all input data into splink a SplinkDataFrame because it's causing problems here

Fundamentally the issue is
exclude_pairs_generated_by_this_rule_sql
on an exploding blocking rule needs a ilnker, which menans that in general the functiion exclude_pairs_generated_by_this_rule_sql needs a ilnker, which means we can't do the analysis without a linker

Unless we want to re-write how blocking works so that it doesn't need a linker, we have to use a linker

The currently implementation that uses:


    if blocking_rules:
        brs_as_objs = settings_obj._brs_as_objs(blocking_rules)
    else:
        brs_as_objs = linker._settings_obj._blocking_rules_to_generate_predictions
``` is correct because settings_obj._brs_as_objs correctly registers the previous blocking rules

@RobinL
Copy link
Member Author

RobinL commented May 14, 2024

Closing in favour of #2180 2180

@RobinL RobinL closed this May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant