Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Exact match level required for TF adjustment - can this be avoided? #2006

Open
samkodes opened this issue Feb 28, 2024 · 13 comments
Open
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

samkodes commented Feb 28, 2024

Is your proposal related to a problem?

I have an application where I wish to link dataset A, representing noisy transactions, to dataset B, a master entity list. For dataset B (the master list) I am using arrays to represent multi-valued fields (e.g. known alternate office phone numbers); however dataset A (the transactions) has only single values in the corresponding fields (arrays of length 1). Because office phone numbers can be shared between multiple people, I would like to use term frequency adjustments. I can use dataset B, my master entity list, to calculate a user-supplied TF table for the relevant fields (exploding the multi-valued fields for an accurate count), which will then be applied to the singleton arrays in dataset A. In my application any match between a phone number in dataset A and any of the entries in the corresponding array field in dataset B counts as a match; so, to compare arrays, I am using array intersection functions with any non-zero intersection as my single non-null, non-"else", comparison level.

The problem is that in order to use TF adjustments, this code looks for u-probabilities corresponding to an "exact match" level in the comparison. An "exact match" seems to be identified via this code, which seems to look for exact field comparisons in the SQL. Since I am using array intersections, I do not have an "exact match" level, and an error is thrown - "An exact match level is required to make a term frequency adjustment on a comparison level that is not an exact match."

One alternative is to add an additional exact match level to the model. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one.

After a lot of thinking, I think I understand why this u-probability is needed. In the term frequency adjustment math here the reference u-probability is just a presentation convenience - the u-probabilities cancel out but are explicitly added to ensure that a term frequency adjustment can be represented as an addition to the weight that would occur without term frequency adjustment. In the calling code here, the exact-match u-probability is used to create the term frequency adjustment. In the exact matching case another identical u-probability must be calculated somewhere else that cancels out.

However, if the match is inexact, I am assuming that in general the u-probability that is calculated elsewhere is for the inexact match. TF adjustments should modify u-probabilities based on the exact field value distribution, regardless of whether the match is exact or not. My reasoning is that value multiplicity in the dataset is the relevant factor - if I have 10 "Sam"s in dataset B I am 10 times as likely to u-match "Sam" in dataset A but also 10 times as likely to fuzzy u-match "Spam" in dataset A, conditional on having value "Sam" in dataset B. However, we need something to adjust by the multiplicity 10. For exact matching, we multiply the multiplicity by 1/N to get T_x, because for exact matching in the absence of duplicates u_x is ~1/N. The problem is estimating u_x, in the absence of duplicate values, for a fuzzy match - i.e. we don't know in general how many records will be hit by a fuzzy match, and we can't estimate this for every x by random sampling (too many samples!). So instead we use an "inflation factor" of u-inexact/u-exact and multiply this by T_x. That's the behaviour here if u-probabilities don't cancel.

The question then is how could Splink get the u-exact probability if, as in my case, there is no exact match level in the model. One option would be to add extra code to sample u-exact probabilities any time there is a TF adjustment (if not already doing so), and store these separately somewhere for retrieval as needed.

Describe the solution you'd like

Rather than demanding a corresponding exact match level in all comparisons that use TF adjustments, can Splink simply pre-compute and store u-exact probabilities separately whenever TF adjustment is required by a model? This may be justified by the argument above - that the u-inexact/u-exact inflation factor is an approximation that allows TF adjustments to be used for inexact match levels, but that there are domains in which an exact match is not meaningfully different from certain kinds of inexact matches.

EDIT: After rethinking, this approach won't work for my application but may have other uses; see response below.

Describe alternatives you've considered

One alternative is to simply define an additional exact match level. The exact match will only happen when my multi-valued field in my master list (dataset B) happens to have a single value. However, this produces 2 match levels in my model, with distinct parameter estimates, when I only want one (i.e., I have a priori knowledge that these two levels do not provide different levels of match evidence). Moreover, the effective TF adjustments for the 2 match levels are different (because they have different inflation factors), behaviour which I don't want.

Additional context

@samkodes samkodes added the enhancement New feature or request label Feb 28, 2024
@samkodes
Copy link
Contributor Author

samkodes commented Feb 28, 2024

After sleeping on it, I now think the proposed solution won't work for my application, though it does make sense for fuzzy matches like Damerau-Levenshtein.

In this case I really want u_x = T_x (term frequency), without any approximation by the inflation factor u-inexact/u-exact.

To achieve this a different solution makes sense and would be simple: simply add a custom setting in each comparison level that uses TF adjustment called "disable_tf_inflation_factor" (settable by dictionary specification to begin with, though could be added to the comparison libraries later). When that setting is on, splink should simply use u-inexact (rather than seeking u-exact) in the TF adjustment weight to cancel out the u-inexact used elsewhere, effectively setting the inflation factor to 1.

An alternative would be to allow manual specification, again through dictionary specification, of a match level to take the role of the "exact match" level, rather than using SQL parsing. This would allow any level to be used as the reference level for the TF inflation factor. I could imagine this being helpful if I were to add a fuzzy array match implementation as another match level to my model; I would then use the exact array intersection match as the reference level. Perhaps a flag called "tf_inflation_reference_level".

@samkodes
Copy link
Contributor Author

samkodes commented Mar 2, 2024

I've implemented the second and third approaches above, using custom comparison level settings flags, and set up a pull request. #2020

@RobinL
Copy link
Member

RobinL commented Mar 3, 2024

Thanks for this and the PR! At a high level:

  • Getting tf adjustments to work for array based columns is a much requested feature
  • We have had various thoughts before about how to do it, but have never settled on a good solution

You can see below I've had a bit of a think about this and here are some initial notes. I ran out of time a bit because it gets pretty complicated and was making my brain hurt a bit. I hope to find a bit more time to think about this a bit more.

For now, I at least think I have most of the key concepts in my head at the moment, but would be interested in talking a bit more about exactly how you want this to work so we can think through options for potential solutions. I'd like to think about whether there are other ways of getting the functionality you want

Would you be up for a chat? - feel free to reach out at robinlinacre@hotmail.com.


I'll start by writing a few notes on how term frequency adjustments currently work for clarity (and partly to remind myself!).

Starting from the basics, the u probability a measure of the likelihood of coincidences occuring whereby information matches amongst non-matching records. For instance, an exact match on first name for two different people.

We can measure an average value for the u probability by taking a large number of non-matching pairs, and calculating how often first name matches.

Let's say that the average u value is 1%.

However, this value does not account for skew - for example, it's much more likely to observe a match on 'John' amongst non-matching record than a match on 'Robin'. Let's say Johns make up 10% of the population and Robins make up 0.1% of the population, so the term-specific u value are 0.1 and 0.001 respectively.

So:

  • taking two non-matching records at random, 1% of the time the first name matches by conincidence
  • but if the first name is John on the first record, and we take a random non-matching record, 10% of the time the first name matches, because the second record is also named John *
  • Is that right? I wasn't quite sure how to phrase this as a conditional probability. You can't condition on both sides of the match being John!!

Current Splink implementation

In Splink we currently measure:

  • average u (1% in the above example)
  • term specific u (10% for John, 0.1% for Robin)

In the case of an exact match, we use the term specific u rather than the average u. As an implementation detail, this is calculated as average u, with an additional adjustment for the delta between average u and term specific u. The reason is to make it easier to represent the term freqnecy adjustment graphically (John can be shown as a downward adjustment of match weight relative to average, Robin can be shown as an upward adjustment relative to average)

Implementation for fuzzy match levels

In the case of a fuzzy match like levenshtein, it's harder to define a definition of a term frequency adjustment because there are two terms. If the match is Jhon vs John, at a levenshtein distance of 1, what term shold be used to make the adjustment?

The decision make within Splink is to take the more frequent of the terms on the basis that one of the two terms is a typo or mistake and therefore is probably uncommon. So in the case of Jhon vs John, we make the term frequency adjustment on the basis of John.

We can now see a second reason why it's useful to apply the term frequency adjustment as a delta between average u and term specific u. Because the u value for the levenshtein level has a different u value than the exact match level, but we can still apply the adjustment to this alternative u value.

However, the strength of the adjustments can be quite high, and it's unclear whether they strictly apply in the case of a fuzzy match. 'John' vs 'Jhon' is probably a typo. Bryan vs Ryan are just two separate names, so it's much less clear how a term frequency adjustment should apply in this case. So we allow a second tf_adjustment_weight of between 0 and 1, which reduces the size of the adjustment. 1 = full adjustment, 0 = no adjustment, and you can pick and value in between to 'partially' apply the adjustment.

Why do we need to find the exact match level to apply tf adjustments to a fuzzy level?

In order to apply tf adjustments to a fuzzy level we need to know: the delta between average u and term specific u. More specifically we need a value for:

  • The term specific u. This is fairly straightforward because it comes directly from the term frequency table. (Albeit with some logic to pick the most common term from Jhon and John in the table)
  • The average u to form the basis of the delta. Since the term freqnecy adjustment table is based on 'exact match', the average u value should be on the same basis.

How could this be used for an array based column?

It's a bit unclear how to define this when the size of the intersection of the array may be 1 or more. Possibly a bit simpler if you somehow just take the least common term in the array.

  • It's possible to calculate array based term frequencies and register the table manually, see here

  • But not obvious how to compute the delta because need to:

    • Perform the tf lookup for each item in the intersection of the array
    • Compare to an average value - a bit unclear how to define the average

@samkodes
Copy link
Contributor Author

samkodes commented Mar 3, 2024

Hi Robin - thanks for the detailed reply.

First, I should clarify that my thinking on TF is a lot clearer for the link-only setting, rather than the de-duplicate setting. I think this is because in the de-dupe setting there is no information about the duplicate-generating process, which makes inference using TF (whether a-priori or empirical estimates) involve a lot of assumptions. If duplicate records are far rarer than duplicate values, however, probably a similar analysis will apply???

In a link-only setting it seems natural to condition the analysis for any pair on the value in one of the datasets (e.g. the left). This is because the link-only task can be thought of as "for each record in left, evaluate the probability of match with each record in right". With this perspective, TFs should be calculated using the right dataset only, and we assume that there are no duplicates in the right dataset. This implies no assumptions on the data generating process for the left data set, except that each left record is independent of all others (no 1-1 match is enforced, etc; in other words duplicates in the left, but not right, dataset are allowed). Of course left and right could be swapped, but then the probability model and data-generating process assumptions would change. In other words, it seems to me that the appropriate probability model is inevitably tied to the definition of the linking task; hence my confusion regarding the de-duplication task. (Our typical assumption is that the m-probabilities do not depend on the precise value but instead reflect generic error processes that apply uniformly across the dataset, so they don't have to be brought into the analysis at all)

I prefer to think of the TF adjustment inflation for inexact (or fuzzy) matches as a scaling factor applied to TF (this is what's happening before taking logs and separating out an additive weight adjustment). This makes clear that it is simply the ratio u-inexact (or u-fuzzy)/u-exact.

How this applies to array-based matches depends on what the user wants; again, I don't think there is a general solution.

It is perhaps clearer to start by not thinking about TF "inflation" in this case - rather just try to think about u_x directly and what that should be.

For example, in my case I am looking for any array intersection at all. Because I treat the dataset with possibly arrays of greater than length 1 as my "master list" (right dataset), and the dataset with singleton arrays as my (possibly duplicated) left dataset, it makes sense to define u_x as the TF for the single element on the left (with TFs calculated based on the RIGHT dataset though). If I had left and right reversed, so my "master list" had single-valued arrays and the other dataset had multi-valued arrays, I might want to define u_x as the sum of the TFs in the multi-valued array (because a match on any of them would count, so probability of a coincidence adds up).

If a match level requires, say, a 2-element match, then u_x would have to be defined in terms of the probability of seeing any 2-element subset in the "master list", and implementing this would be challenging unless the other data set only had arrays of length 2 (so we'd know which 2-element subset we were talking about) - otherwise the combinatorial analysis becomes more complicated ...

I will email you to set up a chat and send you a notebook showing how I applied the fork approach to one of the sample data sets; that may make this example clearer.

@samkodes
Copy link
Contributor Author

samkodes commented Mar 3, 2024

Another example:

Even if an array match level only requires 1 intersection, the analysis of u_x becomes complicated as soon as both arrays can have more than 1 entry.

Above I suggested that when at most 1 array could have more than 1 entry, there were two cases depending on which was the "master list" - either u_x=TF of the singleton (when "master list" has multiple entries) or u_x=sum of TF's in the multi-valued array (when "master list" has singleton entries).

When both arrays can have more than 1 entry, the upper bound of u_x = sum of TF's in the non-master entry, and the lower bound of u_x = maximum TF of the non-master entry. But the sum of TF's figure overcounts when several of its entries are found in a single "master list" entry. A combinatorial analysis could perhaps make this more precise and suggest cheap estimates of this overcount; alternatively a model could be used for the array-generating process (e.g. Poisson for array size, followed by independent selection of entries --- but then results depend on the model assumptions, and array entries are unlikely to be independent! Moreover under an independence model overcounting should be a very minor issue and may be ignored unless arrays are very long or some terms are incredibly frequent .... )

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Still struggling a bit to get my head around this.

Maybe looking at the sql, or taking specific examples, will help clarify ideas. Would you maybe be able to look at the follow example, and use it to construct a new example of the behaviour you want? I don't mean to suggest I'm asking for you to write full sql, just to give me an idea of the modification you're looking for in the SQL and how it would apply?

Here's a test script:

Script to view sql
import logging

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


fuzzy_first_name_with_tf_adj = {
    "output_column_name": "first_name",
    "comparison_levels": [
        {
            "sql_condition": '"first_name_l" IS NULL OR "first_name_r" IS NULL',
            "label_for_charts": "Null",
            "is_null_level": True,
        },
        {
            "sql_condition": '"first_name_l" = "first_name_r"',
            "label_for_charts": "Exact match",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 1.0,
        },
        {
            "sql_condition": 'levenshtein("first_name_l", "first_name_r") <= 2',
            "label_for_charts": "Levenshtein <= 2",
            "tf_adjustment_column": "first_name",
            "tf_adjustment_weight": 0.5,
        },
        {"sql_condition": "ELSE", "label_for_charts": "All other comparisons"},
    ],
    "comparison_description": "Exact match vs. First_Name within levenshtein threshold 2 vs. anything else",
}

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        fuzzy_first_name_with_tf_adj,
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
}


linker = DuckDBLinker(df, settings)
logging.getLogger("splink").setLevel(1)
df_predict = linker.predict()

In that script, we're using levenshtein with the first_name, and allowing TF adjustments on both the exact match and the fuzzy match

The sql is

full sql
CREATE TABLE __splink__df_predict_e91631574
        AS
        (WITH __splink__df_concat_with_tf as (select * from __splink__df_concat_with_tf_6849394a0), 
__splink__df_blocked as (
            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
            , '0' as match_key
            
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."first_name" = r."first_name")
            where l."unique_id" < r."unique_id"
            
             UNION ALL 
            select
            "l"."unique_id" AS "unique_id_l", "r"."unique_id" AS "unique_id_r", "l"."first_name" AS "first_name_l", "r"."first_name" AS "first_name_r", "l"."tf_first_name" AS "tf_first_name_l", "r"."tf_first_name" AS "tf_first_name_r", "l"."surname" AS "surname_l", "r"."surname" AS "surname_r", "l"."dob" AS "dob_l", "r"."dob" AS "dob_r"
            , '1' as match_key
            
            from __splink__df_concat_with_tf as l
            inner join __splink__df_concat_with_tf as r
            on
            (l."surname" = r."surname")
            where l."unique_id" < r."unique_id"
            AND NOT (coalesce((l."first_name" = r."first_name"),false))
            ), 
__splink__df_comparison_vectors as (
    select "unique_id_l","unique_id_r","first_name_l","first_name_r",CASE WHEN "first_name_l" IS NULL OR "first_name_r" IS NULL THEN -1 WHEN "first_name_l" = "first_name_r" THEN 2 WHEN levenshtein("first_name_l", "first_name_r") <= 2 THEN 1 ELSE 0 END as gamma_first_name,"tf_first_name_l","tf_first_name_r","surname_l","surname_r",CASE WHEN "surname_l" IS NULL OR "surname_r" IS NULL THEN -1 WHEN "surname_l" = "surname_r" THEN 1 ELSE 0 END as gamma_surname,"dob_l","dob_r",CASE WHEN "dob_l" IS NULL OR "dob_r" IS NULL THEN -1 WHEN "dob_l" = "dob_r" THEN 1 ELSE 0 END as gamma_dob,match_key 
    from __splink__df_blocked
    ), 
__splink__df_match_weight_parts as (
    select "unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",CASE 
WHEN
gamma_first_name = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_first_name = 2
THEN cast(163.97484984984985 as float8)
 
WHEN
gamma_first_name = 1
THEN cast(2.4703796561604605 as float8)
 
WHEN
gamma_first_name = 0
THEN cast(0.025404270177413344 as float8)
 END as bf_first_name ,CASE WHEN  gamma_first_name = -1 then cast(1 as float8) WHEN  gamma_first_name = 2 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 1 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 0 then cast(1 as float8) END as bf_tf_adj_first_name ,"surname_l","surname_r",gamma_surname,CASE 
WHEN
gamma_surname = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_surname = 1
THEN cast(194.275 as float8)
 
WHEN
gamma_surname = 0
THEN cast(0.05024570024570029 as float8)
 END as bf_surname ,"dob_l","dob_r",gamma_dob,CASE 
WHEN
gamma_dob = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_dob = 1
THEN cast(543.5567010309278 as float8)
 
WHEN
gamma_dob = 0
THEN cast(0.05008754038589973 as float8)
 END as bf_dob ,match_key 
    from __splink__df_comparison_vectors
    ) 
    select
    log2(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob) as match_weight,
    CASE WHEN bf_first_name = cast('infinity' as float8) OR bf_tf_adj_first_name = cast('infinity' as float8) OR bf_surname = cast('infinity' as float8) OR bf_dob = cast('infinity' as float8) THEN 1.0 ELSE (cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)/(1+(cast(0.010101010101010102 as float8) * bf_first_name * bf_tf_adj_first_name * bf_surname * bf_dob)) END as match_probability,
    "unique_id_l","unique_id_r","first_name_l","first_name_r",gamma_first_name,"tf_first_name_l","tf_first_name_r",bf_first_name,bf_tf_adj_first_name,"surname_l","surname_r",gamma_surname,bf_surname,"dob_l","dob_r",gamma_dob,bf_dob,match_key 
    from __splink__df_match_weight_parts
    
    order by 1
    )

Let's take a look at how the term frequency adjustments are applied in the sql.

Bayes factor first name:

CASE
WHEN
gamma_first_name = -1
THEN cast(1.0 as float8)
 
WHEN
gamma_first_name = 2
THEN cast(163.97484984984985 as float8)
 
WHEN
gamma_first_name = 1
THEN cast(2.4703796561604605 as float8)
 
WHEN
gamma_first_name = 0
THEN cast(0.025404270177413344 as float8)
 END as bf_first_name

And then the term frequency bayes factor adjustment is:

CASE WHEN  gamma_first_name = -1 then cast(1 as float8) WHEN  gamma_first_name = 2 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) [/](https://file+.vscode-resource.vscode-cdn.net/)
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(1.0 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 1 then
    (CASE WHEN coalesce("tf_first_name_l", "tf_first_name_r") is not null
    THEN
    POW(
        cast(0.0057935713975033705 as float8) /
    (CASE
        WHEN coalesce("tf_first_name_l", "tf_first_name_r") >= coalesce("tf_first_name_r", "tf_first_name_l")
        THEN coalesce("tf_first_name_l", "tf_first_name_r")
        ELSE coalesce("tf_first_name_r", "tf_first_name_l")
    END)
    ,
        cast(0.5 as float8)
    )
    ELSE cast(1 as float8)
    END) WHEN  gamma_first_name = 0 then cast(1 as float8) END 

as bf_tf_adj_first_name

Here:
163.9748 is the bayes factor corresponding to the average exact match
2.4703 is the bayes factor corresponding to the average levenshtein <2 match

Then, the TF adjustment is defined as:

  • for an exact match, the difference between the average u value (0.00579) for an exact match and the relative term frequency for the name, expressed as a bayes factor
  • for the Levenshtein level, the difference between the average u value (0.00579) for an exact match and the relative term frequency for the name, but with a further modification to reduce by half (0.5) the impact of the adjustment

We then multiply bf_first_name by bf_tf_adj_first_name. So that in the case of the levenshtein level, it's combinng the average u value for the levenshtein with half (0.5) the term frequency adjustment for the exact

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Separately, I've had a think about how it may be possible to combine term frequencies and array columns using the current Splink codebase. Specifically whether it's possible to simultanously:

  • account for the size of the intersection
  • account for the relative token frequencies
  • make it work for an arbitrary bag of words column

You can find some a readme and some code here:
https://github.com/RobinL/array_tf_ideas

This doesn't get you a highly-granular tf adjustment i.e. an adjustment that's different for every single coombination of tokens that intersect, but it does get you quite a long way towards decent predictions

@samkodes
Copy link
Contributor Author

samkodes commented Mar 4, 2024

I like the above approach for combining arrays and TFs but feel like it makes different assumptions than when the array is just meant to be a multi-valued set of options.

In effect the above scoring approach assumes (1) independent selection of intersecting terms (conditional on size of intersection) and (2) that a bigger intersection is stronger evidence of a match. Neither is always the case.

(There should also be some sort of adjustment for both array sizes to view this scoring as an explicit data generating process measuring the "rarity" of a pairing. Possibly multiply by combinatorial coefficients (N_1 choose k)*(N_2 choose k) where k is the size of the intersection and N_1, N_2 the sizes of the arrays in the pair - I think this amounts to conditioning on N_1, N_2 and ignoring higher-order corrections required to avoid an intersection of larger size? )

In this particular application, you might also want to simply exclude stopwords and their abbreviations when data cleaning (like "limited" or "ltd"); standardizing abbreviations will lower the score when they are present because of assumption (2).

@samkodes
Copy link
Contributor Author

samkodes commented Mar 4, 2024

I would like to recant my proposal that u_inexact/u_exact is an appropriate inflation factor; I now believe that it is not, and that more precise calculations are preferable.

The current approach sets match weight to:
log( m_inexact / ((u_inexact/u_exact)*TF)),
ie this sets
u_x = (u_inexact/u_exact)*TF

Unfortunately, I now believe this is wrong, even if we view u_inexact/u_exact as an "inflation factor" accounting for increased rate of coincidences due to fuzzy match. This is because the TF of 1 term, even scaled, cannot be expected to approximate the TF of all fuzzy matches to that term.

For example, suppose we are trying to find a match for "Sam" in a dataset that contains 2 "Sam"s and 8 "Spam"s . Fuzzy match (damerau-levenshtein=1) will hit all 8 Spams (assuming exact matches are not counted here), but TF for Sam remains 2/10. We would need an inflation factor of 4. But a different value in general will need a different inflation factor; for example, if our dataset also had 2 "Peg"s and 18 "Meg"s, when looking for a match for "Peg" we would find all 18 "Megs"; TF for Peg would be 2/30 and we would need an inflation factor of 9.

There is no reason in general for the TF of "Sam" to tell us anything about the TF of "Spam", or the TF of "Peg" to tell us anything about the TF of "Meg". (If TFs were in some sense continuous with respect to the fuzzy-distance metric, that would be another story).

In other words, even if u_inexact/u_exact measured the average "neighborhood size" of the fuzzy match in terms of nearby unique values (or even non-unique values i.e. records), this size is in general unrelated to any particular TF itself. This means that variation in individual TFs will in general add noise to the formula above rather than making it more individualized, which is the whole point of TF adjustments.

The correct value for u_x for "Sam" is the "neighbourhood size" of the fuzzy match operator applied to "Sam" , i.e. the number of records fuzzy-matched to "Sam", divided by the master list's size.

One way to calculate u_x for a fuzzy match level is then to simply do a complete fuzzy match calculation on all values against all other values. We don't have to do this for each record in the master list, just the values, and then total up by the value's multiplicity. That saves some work, but can still be impractical for high cardinality fields.

If the fuzzy match satisfies the triangle inequality, there should be efficient algorithms for doing this in large datasets. (e.g. using anchor points to partition the dataset into neighbourhoods and rule out comparisons between the local neighbourhoods of distant anchor points). An algorithm and python implementation for doing this should already exist - it is a generic task - count the size of the epsilon-neighbourhood of each point in a generic metric space without calculating a complete distance matrix. Questionable whether this can be implemented easily in SQL though!

An alternative would be to estimate this by random sampling for each value. Since fuzzy matches are expected to have a higher hit-rate than exact matches, sampling may be able to provide reliable estimates with a practical number of samples per value (e.g. 1k-10k?). The fuzzier the match, the more effective a sampling approach would be.

While the disable_tf_inflation logic I've submitted could be used to omit the u_inexact/u_exact factor and use a raw user-supplied TF value , every match level would need its own TF table (better to call it a u_x table), so changes to this part of the architecture would also be required.

For the simple case of array intersections I am considering, none of this matters. But in the general case if my analysis is correct, it means the current approach to inexact-match TF adjustments may be unwise?

@RobinL
Copy link
Member

RobinL commented Mar 4, 2024

Agreed - the current approach isn't very sound. I think it only really works in the case of small typos, where one side of the match is 'not a word'. E.g John Vs Jhon, but not Sam Vs Spam. We don't use it very often for this reason but there are some scenarios in which it might improve accuracy a little despite being not very theoretically grounded, especially in cases with heavy data skew (e.g. city name).

@samkodes
Copy link
Contributor Author

samkodes commented Mar 5, 2024

Another few thoughts re the suggesting approach to combining TFs and arrays above.

Your proposed approach may be thought of as hypothesis testing under a particular null hypothesis model (sketched above), with match levels corresponding to different p-value thresholds. This statistical model is not in general related to the statistical model used for Fellegi-Sunter, but is just one way of generating "scores" representing match strength.

Like any method of measuring match strength, different measures will suit different domains and our understanding of what constitutes a match. One downside to the proposal is that it pays no attention to non-matching terms.

Two alternative approaches that would penalize for non-matching terms are inverse-TF-weighted symmetric difference (i.e. (sum of weights of union)-(sum of weights of intersection)) and inverse-TF-weighted Jaccard distance ( i.e. inverse-Tf-weighted symmetric difference / (sum of weights of union). These are not probability models themselves; the first could have arbitrary values while the second is in [0,1]. I believe both of these are metrics (satisfy the triangle inequality) so could be used with the efficient exact methods of calculating u_x I mentioned above.

@samkodes
Copy link
Contributor Author

samkodes commented Mar 5, 2024

Some notes coming out of our conversation today (very fruitful!):

  • Currently, T_x is multiplied by the TF-"inflation factor" (u_inexact/u_exact) to get u_x for inexact matches.
  • In the general case this is inappropriate because inexact matches may have very different multiplicities (TFs) than x (e.g. 2 "Sam" and 10 "Sham") - discussed earlier in thread.
  • In the case where inexact matching is very unlikely to result in anything except typos, the inflation factor is also inappropriate. Here we want to use T_x directly. e.g. "Londno" 1-Levenshtein matching "London" should use u_Londno = T_London , because there are no other cities 1-Levenshtein away from "Londno". Typos matching typos will be relatively rare. In general, the situation when this approach makes sense is when the "master list" consists of sparse elements each further away from each other than the fuzzy match radius, with typos likely to fuzzy match only a single element on the "master list".
  • In the above case (sparse master list), the "inflation factor" u-inexact/u-exact is very hard to interpret anyway because of the conditional structure of the algorithm; u-inexact may not in fact be any higher than u-exact since exact matches are excluded from u-inexact; i.e. it might "deflate".

Conclusion: in general, avoiding this "inflation factor" might be a better idea.

@RobinL
Copy link
Member

RobinL commented Mar 5, 2024

Thanks for the notes re our chat.

Re the arrays:

  • Yeah - I agree there are cases in which we want to pay attention to non matching terms. Agree it would be reasonably straightforward to adjust the calculation to somehow 'punish' non matching terms, and there are various ways you could do this.
  • I think one of the strengths of using these scores indirectly (as comparison levels) as opposed to using the scores directly (e.g. as a u value to drive match weights) is that they then get parameter estimates just like any other comparison level. i.e. the calculation allows us to bin matches into different groups (comparison levels). So long as those groups correspond pretty well to 'matchiness' (i.e. members of the group are similarly 'matchy'), then the approach gives you something useful. Overall I think the exact approach taken is probably data-dependent, but the high level approach is promising for quite a wide variety of use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants