Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probability calibration #181

Open
nbgl opened this issue Jan 21, 2019 · 1 comment
Open

Probability calibration #181

nbgl opened this issue Jan 21, 2019 · 1 comment

Comments

@nbgl
Copy link
Contributor

nbgl commented Jan 21, 2019

Some people say it is important for anonlink to output similarity scores as part of a linkage solution. This makes sense since we are not equally certain about all matches. Outputting similarity scores lets the user decide which matches to keep.

However, I fear that they will be of limited utility. The problem with similarity scores is that they are just a number that does not have an inherent meaning. In particular, one cannot tell the probability of a pair being a match from the similarity score alone. All we know is that the higher the similarity score, the higher the probability of a match. But we don’t know what the relationship actually it.

So let’s find a way to get the probability from a similarity score. This problem is called probability calibration. The usual characterisation is that you have a binary classifier that outputs a score. The higher this score, the higher your probability of being in class 1 (as opposed to class 0). You want to find the probability of being in class 1 based on this score.

Here is a writeup on probability calibration: https://jmetzen.github.io/2015-04-14/calibration.html (thanks @chengsoonong).

It makes sense to apply the standard probability calibration to privacy-preserving record linkage (PPRL). Big caveat: to perform this calibration you need a labelled training set. In PPRL we don’t have this by definition of the problem. So we can apply probability calibration to toy datasets where we know the labels, but we cannot apply the standard probability calibration methods in production.

Before I touched any data, I wrote this in a private Slack chat:

I hypothesise that the mapping from the Jaccard index to the probability of a match will have the same functional form for most datasets. I further hypothesise that once we know this functional form the mapping will be easy to approximate from the threshold.

If the functional form is indeed (approximately) the same for most datasets, and if the parameters of this function can be approximated from the distribution of the similarity scores (in the same way that the threshold can be found without knowing labels, see #142), then we can find the mapping from the similarity score to probability without knowing labels.

The first assumption (same functional form) seems to hold. I ran isotonic regression-based probability calibration on three datasets (NCVR and two versions of my own dataset generated with GeCo, clean and noisy). The result of the isotonic regression looked much like a sigmoid, so I fitted a sigmoid to it.
ncvr
geco-clean
geco-noisy

(NB: Platt scaling is another probability calibration technique, which fits a sigmoid directly to the data. It does not work here because of the massive class imbalance. Fitting a sigmoid to the isotonic regression works a lot better.)

The above plots show a a fitted function f(s), where s is the similarity. We have f(s) = h(as + b) where h is the sigmoid activation function and a and b are parameters. b corresponds to the linkage threshold. It represents the similarity at which a pair is equally likely to be a match as it is to be a non-match. a varies with the separation of the matches and non-matches. If they are well-separated by similarity, then a will be big and thus f will look more like a step function. Both these parameters should be possible to approximate from the distribution of similarities.

Future work:

  1. Try this with more datasets to see if the functional form is the same.
  2. Work out a way to estimate a and b without labels.

@hardbyte @unzvfu @wilko77 Thoughts?

@nbgl
Copy link
Contributor Author

nbgl commented Jan 22, 2019

I’ve tried it with FEBRL4 as well.

febrl-good
febrl-bad

Are there any other interesting datasets?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants