(WIP) Blocking api redesign #2136

RobinL · 2024-04-07T09:43:41Z

Summary

This PR redesigns the user-facing API for the analysis of blocking rules.

It allows the analysis of blocking rules to be conducted without the user needing to have constructed a settings object.
It consolidates existing functions into just two user-facing functions without reducing functionality

Existing functions

At the moment we have the following functions for analysing blocking rules:

linker.count_num_comparisons_from_blocking_rule
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
linker.cumulative_comparisons_from_blocking_rules_records
linker.cumulative_num_comparisons_from_blocking_rules_chart
linker.count_num_comparisons_from_blocking_rules_for_prediction

1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.

Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.

Proposal

Remove functions from linker and make them available from splink.blocking_analysis
Combine 1 and 2 into a single function count_comparisons_from_blocking_rule
Combine 3, 4 and 5 into a two functions:
- cumulative_comparisons_from_blocking_rules_chart
- cumulative_comparisons_from_blocking_rules_records(include_total:bool)

from splink.blocking_analysis import (
    count_comparisons_from_blocking_rule
)

# method can be 'fast_estimate', or 'slow_precise'
count_comparisons_from_blocking_rule(br,  database_api, method="fast_estimate", max_threshold=1e8)

cumulative_comparisons_from_blocking_rules_chart(threshold_for_exact_analysis=1e8)
cumulative_comparisons_from_blocking_rules_records(threshold_for_exact_analysis=1e8)`

Discussion

There are two areas of complexity with analysis of blocking rules:

Interactions between blocking rules: when we have multiple blocking rules, some will generate the same comparisons. As a result, when we analyse multiple blocking rules, we're usually interested in the cumulative number of blocking rules after deduplication.
We have a 'fast/estimate' and a 'slow/precise' method for counting the number of comparisons generated by a single blocking rule. Only the 'slow/precise' on is able to detect duplicates across blocking rules, and therefore can be used to compute cumulative comparisons .

For more details on the 'fast/estimate' technique, see here.

One reason the 'fast/estimate' is important is that a key objective of of blocking rule analysis is to reject blocking rules that generate too many comparisons to be computationally feasible (e.g. block_on("first_name") could generate 1trn comparisons). But if you use the 'slow/precise' methodology to attempt to detect this, the computation will never end. So when using the slow/precise method, it's generally desirable to 'test' everything using the 'fast/estimate' method to ensure the user isn't asking for a computation that will never complete.

Initial work

I've started an initial go that looks like this, but concluded that, prior to getting this going, we need to allow creating __splink__df_concat without the need for a linker

def count_comparisons_from_blocking_rules(
    table_or_tables,
    *,
    blocking_rule_or_rules: blocking_rule_or_rules_type,
    link_type: link_type_type,
    database_api: DatabaseAPI,
    method: str = "comparisons_generated",
):

Complexity

In order to compute the cumulative/marginal number of comparisons created by a set of blocking rules, we need to use a linker.

This is because of the complexity of the logic, particularly around exploding blocking rules.

Each blocking rule needs to be 'aware' of the comparisons generated by the previous rule, and in the case of exploding blocking rules, this involves creating a lookup table of ID-ID pairs that have been created.

As a result, we end up in a less-than-ideal situation in which when writing count_comparisons_to_be_scored_from_blocking_rules function that exists independently of the main linker, we end up needing to create a linker under the hood so the SQL is correctly generated

TODO/Check

[ ] - Tidy up logic for making BlockingRule/BlockingRuleCreator uniform. Do type hinting
[ ] - Move internal logic to splink.internals
[ ] - Check both pure function, and use within 'estimate_probability_two_random_records_match' are consistent and correct in terms of the count of marginal rules, by comparing against Splink 3
[ ] - Check it works with spark and exploding blocking rules
[ ] - Check with works with duckdb irrespective of the type of input data (pandas df, duckdb tablename, etc)

Type hint block_using_rules_sqls correctly and check arguments make sense
Check deterministic_link works correctly with two dataset link only
Check predict works correctly with two dataset link only
Check estimate_probability_two_random_records_match works

RobinL · 2024-05-09T11:13:59Z

Todo/issues:

The blocking rules that go into `cumulative_comparisons_to_be_scored_from_blocking_rules_from_linker' to be 'aware' of the other blocking rules, because this determines match key and the logic for eliminating pairs created by other blocking rules. This is really why we need the linker because of exploding rules etc.
This needs to work with the linker.estimate_probability_two_random_records_match

Separately, probably want to do a PR that makes all input data into splink a SplinkDataFrame because it's causing problems here

Fundamentally the issue is
exclude_pairs_generated_by_this_rule_sql
on an exploding blocking rule needs a ilnker, which menans that in general the functiion exclude_pairs_generated_by_this_rule_sql needs a ilnker, which means we can't do the analysis without a linker

Unless we want to re-write how blocking works so that it doesn't need a linker, we have to use a linker

The currently implementation that uses:


    if blocking_rules:
        brs_as_objs = settings_obj._brs_as_objs(blocking_rules)
    else:
        brs_as_objs = linker._settings_obj._blocking_rules_to_generate_predictions
``` is correct because settings_obj._brs_as_objs correctly registers the previous blocking rules

…king_with_no_linker (WIP) Blocking with no linker

RobinL · 2024-05-14T13:25:02Z

Closing in favour of #2180 2180

RobinL changed the base branch from master to splink4_dev April 7, 2024 09:43

RobinL force-pushed the blocking_api_redesign branch from 3250b1a to eeb2fe1 Compare April 15, 2024 10:08

RobinL mentioned this pull request Apr 15, 2024

Allow __splink__df_concat to be computed without linker #2142

Open

RobinL force-pushed the blocking_api_redesign branch 2 times, most recently from a59041e to dc0bf04 Compare May 1, 2024 07:49

initial blocking redesign commit

2e39e84

RobinL force-pushed the blocking_api_redesign branch from dc0bf04 to 2e39e84 Compare May 1, 2024 07:50

RobinL added 6 commits May 1, 2024 12:13

count_comparisons_generated_from_blocking_rule works

cd4435b

better formatting for blocking rule analysis

ccf11cb

first version works

34209eb

update

57fa1da

rm file

2736f51

wip

bd0959d

RobinL added 15 commits May 9, 2024 12:14

wip

d2a87c5

works again

40a0313

progress to getting rid of linker

863cde8

getting there

f9dea8c

works with no linker

5dd24ac

blocking with no linker works

bda2569

deterministic link

3f273b7

wip as external funcation

da8d057

works for estimate prob

af918b4

all fns seem to work

a01e0ee

tidy up function names

7d56899

rename things

3c66060

improve arg names

2d323e1

Merge pull request #2176 from moj-analytical-services/cumulative_bloc…

94df483

…king_with_no_linker (WIP) Blocking with no linker

get rid of reduction ration

1f3c46e

RobinL closed this May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Blocking api redesign #2136

(WIP) Blocking api redesign #2136

RobinL commented Apr 7, 2024 •

edited

RobinL commented May 9, 2024 •

edited

RobinL commented May 14, 2024

(WIP) Blocking api redesign #2136

(WIP) Blocking api redesign #2136

Conversation

RobinL commented Apr 7, 2024 • edited

Summary

Existing functions

Proposal

Discussion

Initial work

Complexity

TODO/Check

RobinL commented May 9, 2024 • edited

RobinL commented May 14, 2024

RobinL commented Apr 7, 2024 •

edited

RobinL commented May 9, 2024 •

edited