Optimal blocking rule for longer text fields such as full address or product description fields. #2004

seperman · 2024-02-27T21:11:31Z

seperman
Feb 27, 2024

Hello,
Thanks for open-sourcing Splink.
What is the optimal blocking rule for longer text fields such as full address or product description fields? These fields may contain a few words or many words. So, a blocking rule simply based on exact matches or Levenshtein distance does not work.

For example, the dedupe library has the following predicates for short strings:

        predicates.commonFourGram,
        predicates.commonSixGram,
        predicates.tokenFieldPredicate,
        predicates.suffixArray,
        predicates.doubleMetaphone,
        predicates.metaphoneToken,

vs. long text:

        predicates.TfidfNGramCanopyPredicate,
        predicates.TfidfNGramSearchPredicate,
        predicates.TfidfTextCanopyPredicate,
        predicates.TfidfTextSearchPredicate,

I don't see, for example, how to use n-grams in Splink. The examples in the docs are all based on short strings. What do you recommend for long strings?
I am using the duckdb backend.
Thanks.

RobinL · 2024-02-28T13:06:46Z

RobinL
Feb 28, 2024
Maintainer

It's a good question. We don't have many good solutions for this at the moment, but here are some pointers. You'd create the arrays in preprocessing (i.e. pre-importing to splink) using whatever algo you wanted (ngram or whatever).

It is possible to block on lists, but it takes quite a significant performance hit. Unfortunately this is largely undocumented at the moment. See here for test script
Another hack is to create your array and then have several blocking rules like ["l.my_arr[1] = r.my_arr[1]", "l.my_arr[1] = r.my_arr[2]", "l.my_arr[1] = r.my_arr[3]" etc. Obvs not ideal
Another thing that occurred to me that would work quite nicely is using embeddings where you can control the dimensionality (e.g. text-embedding-3-small. These have the nice feature that the position matters, so unlike n-grams, you know the item at index 1 has the same 'meaning' on both sides. You could reduce the dimensionality to some smallish number, then quantise the embeddings and block on them (e.g. "l.my_embed[1] = r.my_embed[1] and l.my_embed[2] = r.my_embed[2] etc.)

0 replies

seperman · 2024-02-28T15:24:53Z

seperman
Feb 28, 2024
Author

Hi @RobinL
Thanks for getting back to me.

Is it possible to pass the thresholds for levenshtein_at_thresholds dynamically based on the length of 2 strings that are being compared? That way, we can adjust the threshold accordingly when blocking on a field with significant length differences between different rows.
For embeddings, can we use the cosine similarity function in Duckdb and set a threshold for blocking instead of comparing for exact matches in embedding?

1 reply

RobinL Feb 28, 2024
Maintainer

On (1), I'm not sure. On (2), yes you could. But unless your data is small it's probably not a good idea, because they're not equi-join conditions. See here for an explainer
https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html#efficient-blocking-rules

lamaeldo · 2024-05-08T09:33:06Z

lamaeldo
May 8, 2024

This is quite late to the party but you may be interested in RapidFuzz's Ratio and WRatio, they're quite clever ways to compare longer strings based on tokens (WRatio specifically if you are processing strings of varying length)

0 replies

seperman · 2024-05-17T16:24:18Z

seperman
May 17, 2024
Author

@lamaeldo That seems like a very expensive way to do blocking, no? It is O(n^2) since you have to compare all the pairs with each other.

0 replies

lamaeldo · 2024-05-17T16:36:54Z

lamaeldo
May 17, 2024

Oh absolutely, it would be terribly inneficient, but it still seems fast enough to work on smaller datasets. But I think it could be of most use in comparison levels

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal blocking rule for longer text fields such as full address or product description fields. #2004

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimal blocking rule for longer text fields such as full address or product description fields. #2004

seperman Feb 27, 2024

Replies: 5 comments · 1 reply

RobinL Feb 28, 2024 Maintainer

seperman Feb 28, 2024 Author

RobinL Feb 28, 2024 Maintainer

lamaeldo May 8, 2024

seperman May 17, 2024 Author

lamaeldo May 17, 2024

seperman
Feb 27, 2024

Replies: 5 comments 1 reply

RobinL
Feb 28, 2024
Maintainer

seperman
Feb 28, 2024
Author

RobinL Feb 28, 2024
Maintainer

lamaeldo
May 8, 2024

seperman
May 17, 2024
Author

lamaeldo
May 17, 2024