New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a lookup service for GrSciColl collections #204
Comments
I put in DEV a first version of the collections lookup service (it still doesn't use the machine tags) to see if this is what we were expecting. It returns a list of institution matches and another for collection matches. It's a list because codes are not unique so there may be cases where we can have multiple options and we can't discriminate by any other field. For each match it shows:
When there are institution matches, a collection only matches fuzzily if it belongs to any of the institution matched. Exact collection matches will always be returned. You can check these examples to see how the service works:
Am I missing something @MortenHofft @timrobertson100 ? |
I've added the machine tag check following the format that was used already in pipelines for now:
The value of the tags should follow the pattern |
It looks good - I'm very curious to see it applied to actual data. verbose or not If we imagine anyone but staff using this, it might be useful to just have a
on naming Real data What constitute a match?
Country as a disambiguator |
I just tried to match based on a csv extract I had from some time ago. I took anything with more than 2500 occurrences and tried to match them against the service.
That isn't a bad start. I haven't evaluated the quality of the matches though. |
For this you mean for the non-verbose version where we show only 1 match, right? It could be something like:
Do you think we should also provide an overall match status? |
I meant when using it in pipelines for assigning GrSciColl IDs to occurrences. I had imagined that this service included the decision and all logic. How does it work for other lookup services? That reminds me, you mentioned the other day that you considered adding all the matched IDs to the occurrence index.
I'm more in favour of version 2. Only adding a GrSciColl ID to an occurrence when we have 1 confident match. Not an array of candidate matches. And if we want more matches, then we address publishers to add better identifiers to either GrSciColl or the occurrences. Or we add machine tags to the datasets in case. If it is useful to have all candidates indexed, could we then consider a separate field for it? |
Should The service return flags. FUZZY COLLECTION CODE MATCH. NO COLLECTION CODE MATCH. Similar to species match service. |
I can think of these flags:
EDIT: the |
I like it - it is my impression that many publishers appreciate those flags and act on them. This will give them the insights to modify data and improve the matching. |
* #204 implementation of the lookup service without using machine tags yet * #204 implementation of the lookup service without using machine tags yet * updated gbif-api version * fixed ITs * fixed ITs * #204 added machine tags check in collections lookup * #204 added verbose and country params + divided matches and alternative matches * #204 added verbose and country params + divided matches and alternative matches * #204 check for non-existing datasets
The service now returns a response like:
The alternative matches are only shown if the It was also added a A match happens if any of these conditions are met:
Additionally, institutions whose owner institution is different than the institution are not considered a match. Also, collections whose institution doesn't match the institution accepted match are also not considered a match. I haven't added the flags but a status field instead:
The rest of the flags can be inferred from the |
I've extracted from our data in PROD combinations of these fields that are present in more than 1000 records:
Additionally, I took the country from the publishing organization of the dataset. Then I passed them to the lookup service in UAT. The results are in this spreadhseet |
This lookup service is intended to link occurrence data with collections. It will use collections data for the lookup but this behaviour could be overwritten with dataset machine tags.
This service could receive the following parameters:
If there are machine tags in the dataset we use them and stop the lookup.
The service should return how good the match is (exact, fuzzy, etc.). Exact matches will only happen if codes match and IDs match or are not contradictory (e.g.: present in only one side).
Anything else? are there any other parameters that can be useful to take into account?
The text was updated successfully, but these errors were encountered: