Probability calibration #181

nbgl · 2019-01-21T05:48:41Z

Some people say it is important for anonlink to output similarity scores as part of a linkage solution. This makes sense since we are not equally certain about all matches. Outputting similarity scores lets the user decide which matches to keep.

However, I fear that they will be of limited utility. The problem with similarity scores is that they are just a number that does not have an inherent meaning. In particular, one cannot tell the probability of a pair being a match from the similarity score alone. All we know is that the higher the similarity score, the higher the probability of a match. But we don’t know what the relationship actually it.

So let’s find a way to get the probability from a similarity score. This problem is called probability calibration. The usual characterisation is that you have a binary classifier that outputs a score. The higher this score, the higher your probability of being in class 1 (as opposed to class 0). You want to find the probability of being in class 1 based on this score.

Here is a writeup on probability calibration: https://jmetzen.github.io/2015-04-14/calibration.html (thanks @chengsoonong).

It makes sense to apply the standard probability calibration to privacy-preserving record linkage (PPRL). Big caveat: to perform this calibration you need a labelled training set. In PPRL we don’t have this by definition of the problem. So we can apply probability calibration to toy datasets where we know the labels, but we cannot apply the standard probability calibration methods in production.

Before I touched any data, I wrote this in a private Slack chat:

I hypothesise that the mapping from the Jaccard index to the probability of a match will have the same functional form for most datasets. I further hypothesise that once we know this functional form the mapping will be easy to approximate from the threshold.

If the functional form is indeed (approximately) the same for most datasets, and if the parameters of this function can be approximated from the distribution of the similarity scores (in the same way that the threshold can be found without knowing labels, see #142), then we can find the mapping from the similarity score to probability without knowing labels.

The first assumption (same functional form) seems to hold. I ran isotonic regression-based probability calibration on three datasets (NCVR and two versions of my own dataset generated with GeCo, clean and noisy). The result of the isotonic regression looked much like a sigmoid, so I fitted a sigmoid to it.

(NB: Platt scaling is another probability calibration technique, which fits a sigmoid directly to the data. It does not work here because of the massive class imbalance. Fitting a sigmoid to the isotonic regression works a lot better.)

The above plots show a a fitted function f(s), where s is the similarity. We have f(s) = h(as + b) where h is the sigmoid activation function and a and b are parameters. b corresponds to the linkage threshold. It represents the similarity at which a pair is equally likely to be a match as it is to be a non-match. a varies with the separation of the matches and non-matches. If they are well-separated by similarity, then a will be big and thus f will look more like a step function. Both these parameters should be possible to approximate from the distribution of similarities.

Future work:

Try this with more datasets to see if the functional form is the same.
Work out a way to estimate a and b without labels.

@hardbyte @unzvfu @wilko77 Thoughts?

nbgl · 2019-01-22T03:49:57Z

I’ve tried it with FEBRL4 as well.

Are there any other interesting datasets?

nbgl added enhancement research labels Jan 21, 2019

hardbyte assigned nbgl Feb 14, 2019

hardbyte added this to the Anonlink 0.12 release milestone Apr 18, 2019

hardbyte modified the milestones: Anonlink 0.12 release, Anonlink 0.13 Release Apr 29, 2019

hardbyte assigned wilko77 Aug 14, 2019

hardbyte removed this from the Anonlink 0.13 Release milestone Aug 14, 2019

hardbyte mentioned this issue Aug 14, 2019

Expose the similarity scores in mapping data61/anonlink-entity-service#275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probability calibration #181

Probability calibration #181

nbgl commented Jan 21, 2019 •

edited

nbgl commented Jan 22, 2019

Probability calibration #181

Probability calibration #181

Comments

nbgl commented Jan 21, 2019 • edited

nbgl commented Jan 22, 2019

nbgl commented Jan 21, 2019 •

edited