Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] (Experimental) Add EM-based term-specific parameter estimation for frequency adjustment when TF tables are not available #2035

Open
samkodes opened this issue Mar 6, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

samkodes commented Mar 6, 2024

These are notes for a blue sky idea that may work somewhere down the line. Low priority!

Is your proposal related to a problem?

Term-frequency adjustment is a way to express the intuition that a match on a rare term provides more evidence than a match on a common term. Splink implements TF adjustment by setting the u-probability for an exact match with a particular value x equal to the term frequency of x. Since u-probabilities are not updated in Splink's version of the EM algorithm, these term-specific u-probabilities can be estimated before EM and used throughout the EM run. The default behaviour is to estimate term frequencies for all values by using empirical frequencies observed in the single data set in a dedupe setting, or in the data set resulting from stacking both data sets if there are two in a link or link-and-dedupe setting. Splink also allows a user to specify a pre-calculated TF table for each column.

The empirical approach to calculating TFs may fail to be accurate when any of the data sets involved contains duplicated individuals; this is one of the most common scenarios in which Splink is used. This failure may happen because there is no way a-priori to distinguish between terms that are repeated because individuals are duplicated and terms that are repeated because e.g. many individuals share the same first name. If an individual with a rare first name is duplicated many times in either data set, it will appear as if that name is very common.

On the other hand, supplying pre-calculated TF tables works mainly when one of the data sets is a single "master list" known to have no duplicated individual. If this is not the case, external sources of TF distributions may be used, but external sources of TF distributions are not always available, or may inappropriately assume that the individuals in the datasets are being sampled uniformly from a larger population.

Describe the solution you'd like

Theoretically, it should be possible to estimate parameters that play the role of TFs as part of the EM algorithm. This idea is inspired by Xu H, Li X, Grannis S. A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage. J Appl Stat. 2021 May 4;49(11):2789-2804. doi: 10.1080/02664763.2021.1922615. PMID: 35909667; PMCID: PMC9336505. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9336505/

Xu et al describe a post-fitting match weight adjustment approach whose key component for an exact field match on value x is the ratio p( x | exact field match, truly matching pair)/ p(x | exact field match, truly non-matching pair). This ratio expresses the relative frequency of the value in the exact field matched records among truly matching and truly non-matching pairs. The estimation of these probabilities is done within blocks and using weighting by the final match probabilities. However, the derivation shows that this estimation could be done within the EM algorithm, using the match probabilities for the previous iteration. This would entail estimating two extra parameters for each exactly-matched field value within each EM iteration. The authors also suggest reducing the range of the resulting match weight adjustments by pooling adjustments for values with similar estimated frequencies e.g. by quantile into 5-10 bins. A similar approach to pooling could be implemented inside the EM and may help convergence; while each parameter would still have to be assigned to a quantile by estimating its frequency, it would contribute only the pooled estimate for its quantile to the next round of the EM.

Note that, like the current approach to TF adjustments, the adjustment parameters estimated could be presented as match weight adjustments at the end of the EM process in a similar waterfall-style chart.

In addition to the large number of parameters that need to be estimated, the main caveat is that the resulting parameter estimates are not TF estimates, but rather estimates of relative frequency of terms among truly matching and truly non-matching pairs. In other words, this may be expected to increase the accuracy of models mostly by increasing the number of parameters in the linkage model in a term-specific way, but the interpretation of these parameters is not as intuitive.

Another caveat is that estimates will only be available for terms encountered in one of the EM training blocks (and will have to be combined across blocks, perhaps similar to Splink's treatment of m-probabilities). Missing estimates need to be handled somehow (maybe just no adjustment?)

Describe alternatives you've considered

Xu et. al's approach could be reimplemented directly as a post-fitting adjustment (i.e. after the EM algorithm). This would be a simpler approach that may still produce useful results. However one strength of Splink's current approach is the TF adjustment can inform the EM process, and it would be nice to see if the Xu approach would perform comparably on an even playing field.

Additional context

@samkodes samkodes added the enhancement New feature or request label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant