Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests that currently trick greedy algorithm #10

Open
hardbyte opened this issue May 30, 2017 · 3 comments
Open

Tests that currently trick greedy algorithm #10

hardbyte opened this issue May 30, 2017 · 3 comments
Assignees

Comments

@hardbyte
Copy link
Collaborator

hardbyte commented May 30, 2017

Our greedy algorithm currently fails matching the following graph, where the connection between a and 1 looks likely, but ultimately shouldn't be chosen.

4048684e-3882-11e6-9a81-105da6c927bd

The network methods should succeed, and for now the greedy algorithm should fail. Correct mapping is:
{a: 2, b: 3, c: 1}

A similar variant where the result will probably be different between the methods (same desired output):

17dad68e-3883-11e6-888d-647c02d5093d

This issue tracks these issues - they exist as tests marked as expectedFailure.

Aha! Link: https://csiro.aha.io/features/ANONLINK-81

@nbgl
Copy link
Contributor

nbgl commented Nov 7, 2018

This kind of situation would be very rare in practice. You would need records A, B, C such that dice(A, B) = dice(A, C), which is unlikely. And if this is happening, then your separation between matches and non-matches should be improved.

Hungarian algorithm-based solvers do perform marginally better under some conditions. However, they have complexity O(V²E), where V is the total number of records and E is the number of candidate pairs (the 2-party greedy solver is O(E)). They are also difficult to apply as they don’t operate on raw similarity scores, but on the probabilities of a match (the threshold represents a probability of ½). You can feed them the raw similarity scores, but this doesn’t have a good theoretical justification.

I’m not sure there is much point pursuing this. @hardbyte are you happy to close the issue?

@hardbyte
Copy link
Collaborator Author

hardbyte commented Nov 7, 2018

You would need records A, B, C such that dice(A, B) = dice(A, C), which is unlikely.

Are you sure about that? I think in my first example dice(C, 1) just needs to be between 0.8 and 0.9 for our greedy solver to fail. The second example is the tie breaker condition which I agree is very unlikely.

Happy if you want to remove the expectedFailure tests and close.

@nbgl
Copy link
Contributor

nbgl commented Nov 8, 2018

If we let, dice(C, 1) = .85 in the first example, then greedy produces the mapping {(A, 1), (B, 3)}. You claim (please correct me if I’m wrong) that the mapping {(A, 2), (B, 3), (C, 1)} is superior. This is because two links with similarities .8 and .85 should take precedence over one link with similarity .9.

But why is that? Should we be optimising for the highest sum of similarities? This doesn’t quite make sense because we would prefer similarities .8 + .85 over a pair with similarity 1, when a pair with similarity 1 should clearly always be accepted. Should we be going for the highest number of accepted pairs above the threshold? This fails for similar reasons.

So in the case that we let dice(C, 1) = .85, it is not clear that the matching you propose is actually superior. It may well be, but this needs better justification.

To really make this call, we’d have to go further than similarity scores, and actually derive the probability p(ismatch(X, Y)). We probably want dice(X, Y) = 0 ⟹ p(ismatch(X, Y)) = 0, dice(X, Y) = threshold ⟹ p(ismatch(X, Y)) = ½, and dice(X, Y) = 1 ⟹ p(ismatch(X, Y)) = 1. These probabilities are nontrivial to compute from raw similarity scores, although we could approximate.

I have a solver that will take these probabilities into account to come up with the theoretically most likely matching. However, this solver performs only marginally better than greedy in practice. (And that’s if you know the exact probabilities!)

@hardbyte hardbyte assigned nbgl and unassigned unzvfu Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants