Replies: 5 comments 1 reply
-
It's a good question. We don't have many good solutions for this at the moment, but here are some pointers. You'd create the arrays in preprocessing (i.e. pre-importing to splink) using whatever algo you wanted (ngram or whatever).
|
Beta Was this translation helpful? Give feedback.
-
Hi @RobinL
|
Beta Was this translation helpful? Give feedback.
-
This is quite late to the party but you may be interested in RapidFuzz's Ratio and WRatio, they're quite clever ways to compare longer strings based on tokens (WRatio specifically if you are processing strings of varying length) |
Beta Was this translation helpful? Give feedback.
-
@lamaeldo That seems like a very expensive way to do blocking, no? It is O(n^2) since you have to compare all the pairs with each other. |
Beta Was this translation helpful? Give feedback.
-
Oh absolutely, it would be terribly inneficient, but it still seems fast enough to work on smaller datasets. But I think it could be of most use in comparison levels |
Beta Was this translation helpful? Give feedback.
-
Hello,
Thanks for open-sourcing Splink.
What is the optimal blocking rule for longer text fields such as full address or product description fields? These fields may contain a few words or many words. So, a blocking rule simply based on exact matches or Levenshtein distance does not work.
For example, the dedupe library has the following predicates for short strings:
vs. long text:
I don't see, for example, how to use n-grams in Splink. The examples in the docs are all based on short strings. What do you recommend for long strings?
I am using the duckdb backend.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions