Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Allow fuzzy matches on array-valued columns #1994

Open
samkodes opened this issue Feb 23, 2024 · 6 comments · May be fixed by #2195
Open

[FEAT] Allow fuzzy matches on array-valued columns #1994

samkodes opened this issue Feb 23, 2024 · 6 comments · May be fixed by #2195
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

Is your proposal related to a problem?

One typical use of array-valued columns is when one known entity has multiple values for a single attribute, a match on any of which is acceptable. For example one person may have multiple telephone numbers or an address history, or even multiple names (married/unmarried names, aliases, etc). While exploding the records outside of splink and post-processing matches is one alternative to array-valued columns, this affects match probabilities and may create memory challenges.

Currently the only way to compare array-valued columns is to use the array_intersect functions, which require exact matches among the elements of the array.

However, in many circumstances, these values are prone to error and fuzzy matching may be required to identify matches.

Describe the solution you'd like

I propose a family of matching functions of the form array_min_sim(x, y, simtype) and array_max_sim(x,y,simtype). Here x and y are arrays, and "simtype" is a string identifying one of the existing fuzzy string comparison metrics (alternatively, a separate array function could be implemented for each metric).

The semantics would be to return the maximum or minimum similarity (alternatively, distance) between an element of array x and an element of array y. These max/min similarities could be thresholded to get comparison levels.

I'm not sure what the best implementation of this would be. Duckdb may allow an implementation using an iterated call to list_reduce. For example, for array_min_sim, the underlying code could be a list_reduce applied to [Infinity,x] (Infinity prepended to x), with reduce function (a, b) -> min(a, list_reduce( [Infinity,y], (c,d) -> min(c, dist(b,d ) ) .

Alternatively, the arrays could be exploded locally and a full cartesian product could be constructed temporarily (with no other variables to reduce memory overhead), the string comparison run on on the expanded data, then aggregated with min/max; I believe this is similar to what is/was being considered for array valued blocking keys (#1448) (though I can't say I understand the internals enough to understand if this will work!)

Describe alternatives you've considered

Un-nesting the data before running splink and post-processing is one option, but this affects m-values and creates memory challenges via duplicated records.

Using array intersections as another alternative, but it demands exact matching.

Additional context

@samkodes samkodes added the enhancement New feature or request label Feb 23, 2024
@samkodes
Copy link
Contributor Author

Adding similar requests #1582 and #1337

@RobinL
Copy link
Member

RobinL commented Feb 23, 2024

Yeah - so we've been experimenting with these new duckdb functions too. We plan to include a function in the comparison library/comparison level library eventually, but for the moment you can do it using custom sql. Here's some pointers if you didn't work it out already:

https://gist.github.com/RobinL/d8a84f7a31fa7cb17dafb05c94518225
https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html?h=custom#method-4-providing-the-spec-as-a-dictionary

@samkodes
Copy link
Contributor Author

samkodes commented Feb 23, 2024

Thanks, that may be a neater approach. Do you have any idea how to make this work on spark? I see there are similar array functions, and will experiment with some.

@ADBond
Copy link
Contributor

ADBond commented Feb 23, 2024

For spark we define a custom udf DualArrayExplode in the included jar which performs the Cartesian product of arrays. So the sort of SQL you would use would be, for example:

ARRAY_MIN(
    TRANSFORM(DUALARRAYEXPLODE(name_l, name_r), x -> LEVENSHTEIN(x['_1'], x['_2']))
) <= 2

@samkodes
Copy link
Contributor Author

samkodes commented Feb 23, 2024

Thanks. Following Robin's sample code, I think udf's can be avoided with an iterated transform to do the cartesian product (a little easier to understand than my iterated reduce proposal, though it involves one extra array operation and possibly intermediate storage). The approach can be put together as something like this:

ARRAY_MIN( 
     TRANSFORM(      
                    FLATTEN( TRANSFORM(name_l, a -> TRANSFORM( name_r, b -> [a,b] ) )) -- make list of pairs 
                    , p -> LEVENSHTEIN(p[1],p[2]) )  -- calc distance between elements of each pair
) -- take minimum 

@zmbc
Copy link
Contributor

zmbc commented Mar 1, 2024

I'm planning to work on a PR for this!

@JonnyShiUW JonnyShiUW linked a pull request May 21, 2024 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants