Replies: 1 comment
-
Hiya. Sorry for the delay in responding, have been on leave. Overall, I think the approach you're taking seems sensible. There may be some room for improvement. Mechanics of null and match weight estimationThe way you're approaching the NULL issue looks sensible to me. I thought it was worth being totally clear on the mechanics here, in case it helps you identify a better way to deal with nulls. By default, Splink treats nulls as a special case. The representation is:
Is represented by Splink as:
The This is just default behaviour: it's totally possible/allowable to estimate match weights for one or more null levels (or omit any null levels) This is what youve done for level (1.) in your list ( You could also get Splink to estimate match weights for level (0.) in your list by provding: {'sql_condition': '"Middle_Name_l" IS NULL OR "Middle_Name_r" IS NULL',
'label_for_charts': 'Null'}
# Note: there's no 'is_null_level': True here (I'm not saying this is a good idea, I'm just mentioning it so you're aware of the mechaniscs here) Possible improvements/things you could tryI've put some ideas below, but they're only rough ideas. I'm not certain they'd be better than what you're already doing. Just in the spirit of brainstorming. First, you're probably doing this already, but feels like turning term frequency adjustments on is particularly important here - the name 'John' should be common if it's a 'special' name, i.e. it's really the second name that's distinguishing the individual. However, this feels like a case where correlation between first name and second name might bite you. You mentioned it's only certain communities who utilize middle names to represent the individual. It seems likely that the distribution of second names will not be independent of the first name, and you'll significantly overweight a match on first name and second name where it exists. So for example the match weight on Perhaps there's scope for modelling the 'full set of first names' here, possibly with a flag for when second name is null e.g. comparison levels of:
Some other thoughts:
You could then design comparison levels that explicitly account for that flag. Perhaps a strong negative match weight for the case of a special first name and a non matching second name like Or you could even consider nulling out the first name where |
Beta Was this translation helpful? Give feedback.
-
In a related consideration to the twins problem mentioned in: #2023, I have been encountering an issue related to twins (or triplets and quadruplets in my case) in communities within a population who utilize middle names to represent the individual.
This has been causing me some issues for two reasons:
This has resulted in difficulty with the separation of siblings sharing DoBs in these situations, resulting in bias towards false positives for these communities.
The problem
Consider the following
link-only
problem, assuming that the middle name is used to differentiate the twins:Dataset A :
Dataset B:
In this case the null level comparison
"Middle_Name_l" IS NULL OR "Middle_Name_r" IS NULL
will result in a prediction producing the same final match weight for linking "John Alan Doe"<->"John Doe" and "John Bob Doe"<->"John Doe", as it skips comparing the middle name field. Resulting in a prediction table like:Where we have a correct strong link for A[0]<->B[0], and weaker equal-weight links for A[0]<->B[1] and A[1]<->B[1], and a lower link for A[1]<->B[0].
Depending on the threshold selected, cluster either returns:
or, with a higher threshold:
The first erroneously brings the A[1] record into the cluster, and the higher threshold over-cautiously creates 3 clusters. The high threshold is likely to impact on non-twins/triplets etc. in the wider dataset and so tuning (as ever) is tricky.
How do I deal with this at the moment?
I currently train 3 models, the first two make use of the dataset-specific Person_IDs to get really good deduplication links. The third model then makes the above link-only links. This third model has a custom comparison which updates the null level comparison to be a logical AND rather than OR and also introduces a level that penalizes for if only one of the records contains a NULL:
"Middle_Name_l" IS NULL AND "Middle_Name_r" IS NULL
("Middle_Name_l" IS NULL AND "Middle_Name_r" IS NOT NULL) OR ("Middle_Name_l\" IS NOT NULL AND \"Middle_Name_r\" IS NULL)
"Middle_Name_l" = "Middle_Name_r"
ELSE
Comparison level 2 applies a penalty to the situation where one record has a middle name but the other does not. I can then use this model to obtain my
link-only
predictions, and then combined with my twodedupe-only
models I can get decent clustering of these situations.Thoughts?
I understand that the NULL problem is an inherently tricky one. I'm wondering if there are other ways to approach this. At the moment I'm fortunate that each respective dataset has decent internal personal identifiers so the 3-model approach helps, but this does seem like an issue that would overly impact certain communities even more than the usual twin problem.
I'm aware of clustering methods in other fields that may go beyond simple connected components, but I'm not sure how they would be applied here.
I'm also not too sure on how the edits to the null level logic may affect the underlying behavior of the model itself, especially in edge cases I may not have thought about.
Beta Was this translation helpful? Give feedback.
All reactions