BUG: Inconsistent xref scoring methods #3333

tillprochaska · 2023-09-26T09:59:58Z

Describe the bug
The "Similar" tab for individual entities and collection-wide xref use different scoring methods.

To Reproduce
Steps to reproduce the behavior:

Xref a collection and pick any pair of matching entities from the xref results.
Click on the first entity of that pair to view the entity details.
Open the "Similar" tab.
Try to find the other entity.
Most likely, the score in the "Similar" tab will be different from the score in the collection-wide xref results.

Expected behavior
Both the "Similar" tab and collection-wide xref should use the same scoring method.

Aleph version
Latest

Additional context

This is probably the case due to historic reasons. Aleph used to use the scoring methods from the followthemoney.compare module which computes similarity scores by property type and then reduces these scores into a single score by weighting property types (e.g. identifier properties might be more important because they are more specific compare to other property types).
Nowadays, Aleph still uses this method to compute the scores in the "Similar" tab. However, for collection-wide xref, Aleph now uses a machine-learning model to infer a similarity score.
I’m not aware of anything blocking us from using the "new" model to compute the similarity scores in the "Similar" tab as well. The main difference to collection-wide xref is that the scores are computed on-demand as part of the request-response cycle, so we’d need to check whether inferring the score is fast enough to not significantly increase response time.
Categorized as a moderate bug because it’s unexpected behavior and confusing for users, but the "Similar" tab doesn’t seem to be used a lot.

The text was updated successfully, but these errors were encountered:

Provide feedback