Entity resolution of PERSON entities with multiple addresses #2099
-
Hello, I am exploring using Splink for the first time for deduplicating PERSON entities in a data source. Thanks so much for the effort and thought that has gone into Splink. It is a great product and very easy to use! For my data it is common for a PERSON to have multiple associated addresses (and phone numbers, e-mail addresses). Is there a preferred method for making use of the information in multiple addresses? If you had a maximum of 2 addresses you could generate 4 comparisons like: I also thought about using a graph approach. You could create a graph using PERSON and ADDRESS entities as nodes, with edges between nodes when a PERSON is associated with an ADDRESS. You could then calculate the shortest path between PERSON entities to determine if they share an ADDRESS. However, calculation of shortest paths seems like an expensive operation to run for every comparison in my blocks. Another downside is that any de-duplications would change the graph so that you would require an iterative approach. Thank you so much for reading my question! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
If you store the addresses in an array column (i.e. a single column, with each row containing a list of addresses), then you can use an array intersection comparison
If you need the comparisons to also allow for fuzzy matches, it's possible to do that using the Spark linker at the moment, see this comment: More broadly this issue contains some ideas, and also some sample code for how a fuzzy array comparison could be implemented in duckdb: |
Beta Was this translation helpful? Give feedback.
If you store the addresses in an array column (i.e. a single column, with each row containing a list of addresses), then you can use an array intersection comparison
If you need the comparisons to also allow for fuzzy matches, it's possible to do that using the Spark linker at the moment, see this comment:
#1994 (comment)
More broadly this issue contains some ideas, and also some sample code for how a fuzzy array comparison could be implemented in duckdb:
#1994