[FEAT] Allow fuzzy matches on array-valued columns #1994

samkodes · 2024-02-23T01:28:02Z

Is your proposal related to a problem?

One typical use of array-valued columns is when one known entity has multiple values for a single attribute, a match on any of which is acceptable. For example one person may have multiple telephone numbers or an address history, or even multiple names (married/unmarried names, aliases, etc). While exploding the records outside of splink and post-processing matches is one alternative to array-valued columns, this affects match probabilities and may create memory challenges.

Currently the only way to compare array-valued columns is to use the array_intersect functions, which require exact matches among the elements of the array.

However, in many circumstances, these values are prone to error and fuzzy matching may be required to identify matches.

Describe the solution you'd like

I propose a family of matching functions of the form array_min_sim(x, y, simtype) and array_max_sim(x,y,simtype). Here x and y are arrays, and "simtype" is a string identifying one of the existing fuzzy string comparison metrics (alternatively, a separate array function could be implemented for each metric).

The semantics would be to return the maximum or minimum similarity (alternatively, distance) between an element of array x and an element of array y. These max/min similarities could be thresholded to get comparison levels.

I'm not sure what the best implementation of this would be. Duckdb may allow an implementation using an iterated call to list_reduce. For example, for array_min_sim, the underlying code could be a list_reduce applied to [Infinity,x] (Infinity prepended to x), with reduce function (a, b) -> min(a, list_reduce( [Infinity,y], (c,d) -> min(c, dist(b,d ) ) .

Alternatively, the arrays could be exploded locally and a full cartesian product could be constructed temporarily (with no other variables to reduce memory overhead), the string comparison run on on the expanded data, then aggregated with min/max; I believe this is similar to what is/was being considered for array valued blocking keys (#1448) (though I can't say I understand the internals enough to understand if this will work!)

Describe alternatives you've considered

Un-nesting the data before running splink and post-processing is one option, but this affects m-values and creates memory challenges via duplicated records.

Using array intersections as another alternative, but it demands exact matching.

Additional context

samkodes · 2024-02-23T02:15:10Z

Adding similar requests #1582 and #1337

RobinL · 2024-02-23T08:49:45Z

Yeah - so we've been experimenting with these new duckdb functions too. We plan to include a function in the comparison library/comparison level library eventually, but for the moment you can do it using custom sql. Here's some pointers if you didn't work it out already:

https://gist.github.com/RobinL/d8a84f7a31fa7cb17dafb05c94518225
https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html?h=custom#method-4-providing-the-spec-as-a-dictionary

samkodes · 2024-02-23T15:54:57Z

Thanks, that may be a neater approach. Do you have any idea how to make this work on spark? I see there are similar array functions, and will experiment with some.

ADBond · 2024-02-23T16:15:30Z

For spark we define a custom udf DualArrayExplode in the included jar which performs the Cartesian product of arrays. So the sort of SQL you would use would be, for example:

ARRAY_MIN(
    TRANSFORM(DUALARRAYEXPLODE(name_l, name_r), x -> LEVENSHTEIN(x['_1'], x['_2']))
) <= 2

samkodes · 2024-02-23T17:51:24Z

Thanks. Following Robin's sample code, I think udf's can be avoided with an iterated transform to do the cartesian product (a little easier to understand than my iterated reduce proposal, though it involves one extra array operation and possibly intermediate storage). The approach can be put together as something like this:

ARRAY_MIN( 
     TRANSFORM(      
                    FLATTEN( TRANSFORM(name_l, a -> TRANSFORM( name_r, b -> [a,b] ) )) -- make list of pairs 
                    , p -> LEVENSHTEIN(p[1],p[2]) )  -- calc distance between elements of each pair
) -- take minimum

zmbc · 2024-03-01T22:06:15Z

I'm planning to work on a PR for this!

samkodes added the enhancement New feature or request label Feb 23, 2024

samkodes mentioned this issue Feb 23, 2024

[FEAT] Pairwise array comparison levels #1337

Open

JonnyShiUW linked a pull request May 21, 2024 that will close this issue

Array string distance alpha #2195

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Allow fuzzy matches on array-valued columns #1994

[FEAT] Allow fuzzy matches on array-valued columns #1994

samkodes commented Feb 23, 2024

samkodes commented Feb 23, 2024

RobinL commented Feb 23, 2024

samkodes commented Feb 23, 2024 •

edited

ADBond commented Feb 23, 2024

samkodes commented Feb 23, 2024 •

edited

zmbc commented Mar 1, 2024

[FEAT] Allow fuzzy matches on array-valued columns #1994

[FEAT] Allow fuzzy matches on array-valued columns #1994

Comments

samkodes commented Feb 23, 2024

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

samkodes commented Feb 23, 2024

RobinL commented Feb 23, 2024

samkodes commented Feb 23, 2024 • edited

ADBond commented Feb 23, 2024

samkodes commented Feb 23, 2024 • edited

zmbc commented Mar 1, 2024

samkodes commented Feb 23, 2024 •

edited

samkodes commented Feb 23, 2024 •

edited