Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Semi-deterministic or semi-supervised matching by incorporating deterministic match rules or labelling into EM (help to guide EM) #2030

Open
samkodes opened this issue Mar 5, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@samkodes
Copy link
Contributor

samkodes commented Mar 5, 2024

Is your proposal related to a problem?

There are many circumstances in which we have some prior knowledge about matches in or between datasets. It is difficult to express this knowledge in the current formalism and to have that knowledge used by the EM algorithm to estimate appropriate m-probabilities and overall match probabilities. In general this knowledge will apply to only a subset of record pairs, and we may need to rely on EM to estimate match probabilities for all other record pairs. When this knowledge is described in simple deterministic match rules, we can think of this as "semi-deterministic" matching. We may also obtain information after a clerical review that we wish to use in a systematic way to guide revisions to the model; this is the more traditional "semi-supervised" approach. For this proposal "semi-deterministic" and "semi-supervised" differ only in terms of how this extra knowledge is specified, but in both cases we can incorporate the extra knowledge into the EM estimation in the same way.

For background on a semi-supervised approach focused on "active learning" informing this proposal, see Enamorado, Ted, Active Learning for Probabilistic Record Linkage (September 20, 2018). Available at SSRN: https://ssrn.com/abstract=3257638 or http://dx.doi.org/10.2139/ssrn.3257638. Henceforth referred to as Enamorado (2018).

e.g. 1: Some subset of records has a non-null unique ID number (e.g. health card, drivers license, corrections ID) that can be used to unambiguously rule in and/or rule out matches. EM will rarely achieve sufficiently strong positive and negative weights on this kind of field to overwhelm all other fields. The known matches may also helpfully inform m-probability estimation. We may want to force these pair probabilities to 1 (rule in on ID match) or 0 (rule out on ID mismatch).

e.g. 2: Some records were generated by exploding array-valued fields, so are known to be related to the same person. This might include name aliases (e.g. Bob, Rob, Bobby) or other systematic inexact matches (e.g. changing addresses within a city) which should inform fuzzy-match m-probabilities. We may wish to force these pair probabilities to 1.

e.g. 3: Some match patterns are known to be highly implausible for matches (e.g. entirely different first and last names). However sometimes other fields will dominate and models will produce high match probabilities for these patterns, which can compound in misleading m-probability estimation as the EM proceeds. We may wish to force these pair probabilities to some low non-zero value (or 0). (Alternatively, we may have high confidence that some match patterns are extremely likely matches, and may want to ensure they are treated as such.)

e.g. 4: (The situation considered by Enamorado (2018)). We have manually labelled a small sample as matches or non-matches, perhaps as part of a clerical review process. We wish to rerun our model and use this small sample to inform our model fitting.

Describe the solution you'd like

One general approach, suited to the "semi-deterministic" case is based on a user-defined query is as follows.

(Note that a different design might be preferred for the semi-supervised case involving clerical review, for example if active-learning style sample selection or a GUI are implemented down the road. In that case a user-defined query will be clumsy, and some other way of specifying a labelled subset of data would have to be implemented, but could work alongside this approach using the same modifications to the EM code. Conceptually, a labelled subset of data could consist of a user-supplied set of record id pairs, each with a label status and an omega weight (see below), which is used to label part of the post-blocking pair table, or possibly add records to each block if needed, before EM is run.)

After blocking, but before EM, run a user-specified query to that generates two columns from the pairs table. First, a "supervised_match_probability" column represents a known probability of match for a pair. This column can have any value between 0 and 1 (inclusive of 0 and 1), or can be left null (to indicate no label). Most often this will have values 0, 1, or null. Second, a "supervised_omega" column represents the relative weight given to this labelled pair relative to the unlabeled pairs. This is necessary to increase the impact the small number of labelled pairs has on the overall model fitting process. The query defining these two columns should be able to use any fields available at this point in the pipeline, including field values and gamma values for different comparisons.

During EM iterations, for any pair with non-null "supervised_match_probability", the algorithm always uses this probability instead of the calculated (iterative) match probability. Specifically, the "supervised_match_probability" is used each iteration when estimating m-probabilities. "supervised_match_probability" values do not change during EM. The "supervised_omega" value is used to weight terms in the M-step according to Appendix A.1.2 formula (15) in Enamorado (2018) (though their convention is to weight the non-labelled pairs, I think it is better to weight the labelled pairs since we may be incorporating several different groups of labelled pairs for which we may want different weights).

When making predictions, the same query should be applied to generate a "supervised_match_probability" output column; TBD whether it overwrites the model prediction column (probably would be nice to have both). There is no need to include a 'supervised_omega' in the prediction output since it is not needed for the prediction step.

Describe alternatives you've considered

Another way to guide EM is pre-specifying certain m-probabilities and keeping them fixed. This is more demanding for the analyst, particularly when there are many levels, and may require more specific knowledge of the data-generating process. It also does not directly control the match probability for any pair, which is arguably of more interest.

EM can also sometimes be guided by careful blocking selection for training, but results can be very hard to predict.

Known matches or non-matches can be ruled in or out after modelling and predicting, but this means that this information can't be used to inform the EM process and refine the model.

Additional context

The Enamorado 2018 approach is based on prior approaches to semi-supervised FS, so has some precedent on its side.

I might speculate that doing this may affect EM convergence in a bad way, though Enamorado (2018) says otherwise (but Splink is not doing full EM, unlike the Enamorado paper which also updates the prior and the u-probabilities).

@samkodes samkodes added the enhancement New feature or request label Mar 5, 2024
@RobinL
Copy link
Member

RobinL commented Mar 6, 2024

Thanks - this all makes sense

From a practical/implementation standpoint it occurred to me that it might be reasonably straightforward to implement this by tweaking existing routines.

I think you might get the right behaviour if we allow individual m and u values to be fixed during EM training. (At the moment it's possible to fix all u values across all comparisons and comparison levels, but not individual u values.)

Specifically, if it were possible to fix very strong (effectively infinity and negative infinity) match weights on the ID column,
then those weights would overwhelm all other match weights, and the pairs would end up with 1.0 or 0.0 probabilities for any pairwise comparisons where the ID column was present.

Any comparison that lacked the ID column (null on at least one side) would get a 0 match weight (no effect), and so its probability would be based on all the other columsn.

We then would need additional code to allow the 1.0 and 0.0 options to be weighted more highly - but that could probably be achieved by somehow adding sum(match_probability) * omega multiplier into the maximisation part of EM.

As we talked about, this probably isn't going to be something we'll have time to look at any time soon. If at any point it's something you urgently need, let me know. For any development work that's more than a minor patch, I'd want to put it on the Splink 4 branch which is where we're trying to direct all new significant work (prereleases are starting to become available but docs currently lacking).

@samkodes
Copy link
Contributor Author

samkodes commented Mar 16, 2024

FYI, I'm starting to test an implementation of the full semi-supervised approach that we think will help our work. Because we're not ready to move to the dev version of Splink 4, I'm working on a Splink 3 fork that I'll share for unofficial comment / interest after some more testing, probably some time next week. I'll be happy to work on an official Splink 4 PR for it if the Splink 3 version proves stable and useful.

@RobinL
Copy link
Member

RobinL commented Mar 19, 2024

Thanks @samkodes - sounds promising!

@samkodes
Copy link
Contributor Author

I have an initial version working in this branch: https://github.com/samkodes/splink_tf_inexact/tree/semi_supervised_training

I decided that it made the most sense to tie the semi-supervised specification to the individual EM training block, since the impact of labelled cases (and hence the weight required) really depends on the block size and content. Therefore I didn't change anything in the settings object or overall linker configuration. This also means that prediction for output right now is agnostic about these features (though prediction for training needed some small tweaks), so it is up to the user to decide how they want to incorporate labels into their final output and to do so manually. All this version does is allow labelled data to influence the training process.

From a user interface perspective the main changes are in linker.estimate_parameters_using_expectation_maximisation, which accepts three new parameters:

      semi_supervised_rules (list, optional): If set, splink will run in 
                semi-supervised mode, applying the rules to the blocked pairs to label
                and weight these labelled data when running EM. Each element of the
                list should be a dictionary object containing three elements:
                    'sql' - SQL condition 
                    'match_probability' - number between 0 and 1 inclusive 
                    'omega' - weight 
                The rules will be tested and applied to the blocked pairs in the sequence 
                provided via a case-when statement. 
                May be used in tandem with semi_supervised_table.
                Rules may refer to gamma values or field names (in name_l/name_r format).
     semi_supervised_table (optional): If set, splink will run in
                semi-supervised mode, using the provided table to as labeled
                cases. The table should be structured like the table provided
                to estimate_m_from_pairwise_labels, with the addition of a column
                called 'match_probability' indicating match probability and a column 'omega' 
                indicating case weight. i.e. 
                source_dataset_l, unique_id_l, source_dataset_r, unique_id_r, match_probability, omega 
                May be used in tandem with semi_supervised_rules.
                By default, this table is used to label cases generated by the
                blocking rule. If you want this table to be used to extend pairs
                generated by the blocking rule, set extend_block_by_ss_table=True.
                May be used in tandem with semi_supervised_rules.
       extend_block_by_ss_table (bool, optional): If set, uses semi_supervised_table
                to extend the pairs generated by the blocking rule. Use with caution,
                as extending a block may bias the estimated parameters.

These parameters are passed to the EMTrainingSession.__init__ method and this class and the expectation_maximisation.py does most of the work, though there were some small parameter additions to helper functions in predict.py and a few other files to deal with the semi-supervised case.

I also changed the behaviour of the comparisons_to_deactivate argument to linker.estimate_parameters_using_expectation_maximisation and EMTrainingSession.__init__ to allow the local estimation of parameters involved in blocking as described in issue #2067. Doing so actually helped with convergence issues in my test data, but was also important to allow semi-supervised rules to have access to all needed variables.

Some of the SQL and pipelining may be unnecessarily clumsy; I tried to reuse existing functions where they existed even if they may not have been optimal for this application. In particular, generating comparison vectors when extending the block by a provided semi-supervised table and applying the provided rules or table to the comparison vectors before EM could probably be done more elegantly. I'd be happy to receive suggestions about better ways to implement things.

I've also included an example synthetic dataset for my challenge application - many doctors working at multiple offices (many-to-many office-doctor relationship) - and an example notebook showing how the semi-supervised features improve the fit (precision-recall curves) dramatically. Those are in the new directory semi_supervised_examples.

When you have a chance, let me know if this looks appealing enough for inclusion in version 4 (after suitable adjustments), and if so I can work on a similar patch for that codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants