Expose similarity score in groups result #180

hardbyte · 2018-12-04T04:01:59Z

~~This feature is to output the solved mapping while also exposing their similarity scores.~~

This feature is to modify anonlink to support anonlink-entity-service (and library users) in calculating similarity scores between group members.

It may make sense to provide a high level api to compute similarity scores using the group output type, and then recompute the scores between members of the group.

A good chance to refactor the high lever solver api.

The text was updated successfully, but these errors were encountered:

hardbyte · 2019-01-08T23:11:27Z

So the greedy_solve function (and friends) don't return the similarity information.

From the docstrings:

    :return: An sequence of groups. Each group is an sequence of
        records. Two records are in the same group iff they represent
        the same entity. Here, a record is a two-tuple of dataset index
        and record index.

If we were only dealing with two datasets we could simply include the similarity score between the records. However in the multiparty case with merge_threshold etc it isn't as simple.

Enter @nbgl who I think has thought about this already...?

hardbyte · 2019-09-24T00:52:58Z

We could output a list of scores for each group after the solver step.

wilko77 · 2019-09-24T05:09:09Z

one curve-ball with the multiparty greedy solver is that if the merge_treshold is smaller than 1, then it might put two entities into the same group, although their pairwise similarity is under the threshold.
Thus, we would not necessarily have a complete list of similarities for each group.
Depending on the use-case this might be fine. (we could add some auxiliary string like "below_treshold", or re-compute the missing similarities)

Do we also need to output to which pair each of these similarity scores belongs to?

Just as an idea, since we can compute similarities quite cheaply, we could, instead of modifying the current solver, introduce a new step after solving, which computes all the required similarities again (with threshold set to 0). This shouldn't take long, as the mappings are a small subset of the whole candidate space, it is cleaner, and doesn't introduce overhead into the solver in case it isn't needed.

hardbyte · 2019-10-07T23:40:17Z

TODO: make this about groups rather than mapping. Plan is to deprecate mapping output type.

hardbyte assigned nbgl Dec 4, 2018

hardbyte added the feature request label Jan 11, 2019

hardbyte added this to the Anonlink 0.12 release milestone Apr 18, 2019

hardbyte modified the milestones: Anonlink 0.12 release, Anonlink 0.13 Release Apr 29, 2019

hardbyte unassigned nbgl Aug 17, 2019

hardbyte changed the title ~~Expose score in mapping~~ Expose similarity score in groups result Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose similarity score in groups result #180

Expose similarity score in groups result #180

hardbyte commented Dec 4, 2018 •

edited

hardbyte commented Jan 8, 2019 •

edited

hardbyte commented Sep 24, 2019

wilko77 commented Sep 24, 2019

hardbyte commented Oct 7, 2019

Expose similarity score in groups result #180

Expose similarity score in groups result #180

Comments

hardbyte commented Dec 4, 2018 • edited

hardbyte commented Jan 8, 2019 • edited

hardbyte commented Sep 24, 2019

wilko77 commented Sep 24, 2019

hardbyte commented Oct 7, 2019

hardbyte commented Dec 4, 2018 •

edited

hardbyte commented Jan 8, 2019 •

edited