Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose similarity score in groups result #180

Open
hardbyte opened this issue Dec 4, 2018 · 4 comments
Open

Expose similarity score in groups result #180

hardbyte opened this issue Dec 4, 2018 · 4 comments

Comments

@hardbyte
Copy link
Collaborator

hardbyte commented Dec 4, 2018

This feature is to output the solved mapping while also exposing their similarity scores.

This feature is to modify anonlink to support anonlink-entity-service (and library users) in calculating similarity scores between group members.

It may make sense to provide a high level api to compute similarity scores using the group output type, and then recompute the scores between members of the group.

A good chance to refactor the high lever solver api.

@hardbyte
Copy link
Collaborator Author

hardbyte commented Jan 8, 2019

So the greedy_solve function (and friends) don't return the similarity information.

From the docstrings:

    :return: An sequence of groups. Each group is an sequence of
        records. Two records are in the same group iff they represent
        the same entity. Here, a record is a two-tuple of dataset index
        and record index.

If we were only dealing with two datasets we could simply include the similarity score between the records. However in the multiparty case with merge_threshold etc it isn't as simple.

Enter @nbgl who I think has thought about this already...?

@hardbyte
Copy link
Collaborator Author

We could output a list of scores for each group after the solver step.

@wilko77
Copy link
Collaborator

wilko77 commented Sep 24, 2019

one curve-ball with the multiparty greedy solver is that if the merge_treshold is smaller than 1, then it might put two entities into the same group, although their pairwise similarity is under the threshold.
Thus, we would not necessarily have a complete list of similarities for each group.
Depending on the use-case this might be fine. (we could add some auxiliary string like "below_treshold", or re-compute the missing similarities)

Do we also need to output to which pair each of these similarity scores belongs to?

Just as an idea, since we can compute similarities quite cheaply, we could, instead of modifying the current solver, introduce a new step after solving, which computes all the required similarities again (with threshold set to 0). This shouldn't take long, as the mappings are a small subset of the whole candidate space, it is cleaner, and doesn't introduce overhead into the solver in case it isn't needed.

@hardbyte
Copy link
Collaborator Author

hardbyte commented Oct 7, 2019

TODO: make this about groups rather than mapping. Plan is to deprecate mapping output type.

@hardbyte hardbyte changed the title Expose score in mapping Expose similarity score in groups result Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants