Middle names, null levels, and clustering #2093

medwar99 · 2024-03-25T16:23:44Z

medwar99
Mar 25, 2024

In a related consideration to the twins problem mentioned in: #2023, I have been encountering an issue related to twins (or triplets and quadruplets in my case) in communities within a population who utilize middle names to represent the individual.

This has been causing me some issues for two reasons:

as mentioned in the above discussion, final match weights are quite high and makes setting a suitable threshold tricky compared to the rest of the population
datasets can often vary in how accurately they record middle names, and so often null counts for middle names are higher. This means that the "null level" comparison using a logical OR to check for nulls in either record can mean that a zero partial weight is applied.

This has resulted in difficulty with the separation of siblings sharing DoBs in these situations, resulting in bias towards false positives for these communities.

The problem

Consider the following link-only problem, assuming that the middle name is used to differentiate the twins:

Dataset A :

unique_id	Dataset_A_Person_ID	Forename	Middle Name	Surname	DoB	...
0	1	John	Alan	Doe	...	...
1	2	John	Bob	Doe	...	...

Dataset B:

unique_id	Dataset_B_Person_ID	Forename	Middle Name	Surname	DoB	...
0	1000	John	Alan	Doe	...	...
1	1000	John	NULL	Doe	...	...

The two records in dataset A are twins, with their middle name denoting them as two individuals, corroborated by them having different IDs (think social security number or similar)
The null entry for the middle name in dataset B is a data collection issue, and we know that dataset B's records should both belong to John Alan, corroborated by them sharing the same ID for this dataset (account number or similar).

In this case the null level comparison "Middle_Name_l" IS NULL OR "Middle_Name_r" IS NULL will result in a prediction producing the same final match weight for linking "John Alan Doe"<->"John Doe" and "John Bob Doe"<->"John Doe", as it skips comparing the middle name field. Resulting in a prediction table like:

match_weight	source_dataset_l	source_dataset_r	unique_id _l	unique_id_r	Middle Name_l	Middle Name_r
+++	A	B	0	0	Alan	Alan
++	A	B	0	1	Alan	NULL
++	A	B	1	1	Bob	NULL
--	A	B	1	0	Bob	Alan

Where we have a correct strong link for A[0]<->B[0], and weaker equal-weight links for A[0]<->B[1] and A[1]<->B[1], and a lower link for A[1]<->B[0].

Depending on the threshold selected, cluster either returns:

cluster_id	unique_id	source_dataset
A-__-0	A	0
A-__-0	A	1
A-__-0	B	0
A-__-0	B	1

or, with a higher threshold:

cluster_id	source_dataset	unique_id
A-__-0	A	0
A-__-1	A	1
A-__-0	B	0
B-__-1	B	1

The first erroneously brings the A[1] record into the cluster, and the higher threshold over-cautiously creates 3 clusters. The high threshold is likely to impact on non-twins/triplets etc. in the wider dataset and so tuning (as ever) is tricky.

How do I deal with this at the moment?

I currently train 3 models, the first two make use of the dataset-specific Person_IDs to get really good deduplication links. The third model then makes the above link-only links. This third model has a custom comparison which updates the null level comparison to be a logical AND rather than OR and also introduces a level that penalizes for if only one of the records contains a NULL:

"Middle_Name_l" IS NULL AND "Middle_Name_r" IS NULL
("Middle_Name_l" IS NULL AND "Middle_Name_r" IS NOT NULL) OR ("Middle_Name_l\" IS NOT NULL AND \"Middle_Name_r\" IS NULL)
"Middle_Name_l" = "Middle_Name_r"
ELSE

Comparison level 2 applies a penalty to the situation where one record has a middle name but the other does not. I can then use this model to obtain my link-only predictions, and then combined with my two dedupe-only models I can get decent clustering of these situations.

Thoughts?

I understand that the NULL problem is an inherently tricky one. I'm wondering if there are other ways to approach this. At the moment I'm fortunate that each respective dataset has decent internal personal identifiers so the 3-model approach helps, but this does seem like an issue that would overly impact certain communities even more than the usual twin problem.

I'm aware of clustering methods in other fields that may go beyond simple connected components, but I'm not sure how they would be applied here.

I'm also not too sure on how the edits to the null level logic may affect the underlying behavior of the model itself, especially in edge cases I may not have thought about.

RobinL · 2024-04-19T11:12:26Z

RobinL
Apr 19, 2024
Maintainer

Hiya. Sorry for the delay in responding, have been on leave.

Overall, I think the approach you're taking seems sensible. There may be some room for improvement.

Mechanics of null and match weight estimation

The way you're approaching the NULL issue looks sensible to me. I thought it was worth being totally clear on the mechanics here, in case it helps you identify a better way to deal with nulls.

By default, Splink treats nulls as a special case. The representation is:

comparison_level_library.null_level("Middle_Name").as_dict()

Is represented by Splink as:

{'sql_condition': '"Middle_Name_l" IS NULL OR "Middle_Name_r" IS NULL',
 'label_for_charts': 'Null',
 'is_null_level': True}

The is_null_level = true is a special key that says to Splink: "Evaluate this as a comparison vector value of -1". In turn, Splink treats comparison vector values of -1 as zero-weight. That is, ignore this level for the purpose of parameter estimation

This is just default behaviour: it's totally possible/allowable to estimate match weights for one or more null levels (or omit any null levels)

This is what youve done for level (1.) in your list (("Middle_Name_l" IS NULL AND "Middle_Name_r" IS NOT NULL) OR ("Middle_Name_l\" IS NOT NULL AND \"Middle_Name_r\" IS NULL))

You could also get Splink to estimate match weights for level (0.) in your list by provding:

{'sql_condition': '"Middle_Name_l" IS NULL OR "Middle_Name_r" IS NULL',
 'label_for_charts': 'Null'}
# Note: there's no  'is_null_level': True here

(I'm not saying this is a good idea, I'm just mentioning it so you're aware of the mechaniscs here)

Possible improvements/things you could try

I've put some ideas below, but they're only rough ideas. I'm not certain they'd be better than what you're already doing. Just in the spirit of brainstorming.

First, you're probably doing this already, but feels like turning term frequency adjustments on is particularly important here - the name 'John' should be common if it's a 'special' name, i.e. it's really the second name that's distinguishing the individual.

However, this feels like a case where correlation between first name and second name might bite you. You mentioned it's only certain communities who utilize middle names to represent the individual. It seems likely that the distribution of second names will not be independent of the first name, and you'll significantly overweight a match on first name and second name where it exists.

So for example the match weight on P(John) * P(Alan) << P(John Alan)

Perhaps there's scope for modelling the 'full set of first names' here, possibly with a flag for when second name is null

e.g. comparison levels of:

exact match on concat(first names), term frequencies enabled
exact match on first_name only, with both second names null
exact match on first name only, with at least one second name null, term frequcnies enable on first_name
exact match on second name, with fuzzy match on first name
exact match on second_name only, with term frequencies enabled

Some other thoughts:

could you override the estimated negative match weight of a mismatch on second name, and the case of null Vs present second name to make it more negative? This might be especially useful if you can design a flag for the 'special' second names (see below)
If there are certain first names which you know to be affected, you could do additional data cleaning or additional logic in your comparison levels. For example, if it were the case that commonly when someone's first name is John, they're often known by their second name, you could tr you could derive a variable called like special_first_name_flag.

You could then design comparison levels that explicitly account for that flag. Perhaps a strong negative match weight for the case of a special first name and a non matching second name like if special_first_name_flag_l = True and special_first_name_flag_r = True and first_name_l = first_name_r and second_name_l != second_name_r

Or you could even consider nulling out the first name where special_first_name_flag=True and a second name is present

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Middle names, null levels, and clustering #2093

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Middle names, null levels, and clustering #2093

medwar99 Mar 25, 2024

The problem

How do I deal with this at the moment?

Thoughts?

Replies: 1 comment

RobinL Apr 19, 2024 Maintainer

Mechanics of null and match weight estimation

Possible improvements/things you could try

medwar99
Mar 25, 2024

RobinL
Apr 19, 2024
Maintainer