Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a lookup service for GrSciColl collections #204

Closed
marcos-lg opened this issue Jun 17, 2020 · 11 comments
Closed

Implement a lookup service for GrSciColl collections #204

marcos-lg opened this issue Jun 17, 2020 · 11 comments
Assignees
Labels
GRSciColl Issues related to institutions, collections and staff question

Comments

@marcos-lg
Copy link
Contributor

This lookup service is intended to link occurrence data with collections. It will use collections data for the lookup but this behaviour could be overwritten with dataset machine tags.

This service could receive the following parameters:

  • institution code
  • institution ID
  • collection code
  • collection ID
  • dataset key
  • owner institution code ??

If there are machine tags in the dataset we use them and stop the lookup.

The service should return how good the match is (exact, fuzzy, etc.). Exact matches will only happen if codes match and IDs match or are not contradictory (e.g.: present in only one side).

Anything else? are there any other parameters that can be useful to take into account?

@marcos-lg
Copy link
Contributor Author

I put in DEV a first version of the collections lookup service (it still doesn't use the machine tags) to see if this is what we were expecting.

It returns a list of institution matches and another for collection matches. It's a list because codes are not unique so there may be cases where we can have multiple options and we can't discriminate by any other field.

For each match it shows:

  • type: it can be exact or fuzzy. exact is only if both the code and identifier match. Otherwise it's fuzzy.
  • remarks: they are observations to understand how the match was done. The possible values are:
    • CODE_MATCH: it doesn't ignore the case
    • IDENTIFIER_MATCH: it doesn't ignore the case
    • ALTERNATIVE_CODE_MATCH: it doesn't ignore the case
    • NAME_MATCH: it ignores the case and removes accents and whitespaces but doesn't do prefix or suffix matches
    • PROBABLY_ON_LOAN: it happens when the owner institution and the institution are not the same
    • INST_COLL_MISMATCH: it happens when the institution of a collection is not present in the institutions matched.

When there are institution matches, a collection only matches fuzzily if it belongs to any of the institution matched. Exact collection matches will always be returned.

You can check these examples to see how the service works:

Am I missing something @MortenHofft @timrobertson100 ?

@marcos-lg
Copy link
Contributor Author

I've added the machine tag check following the format that was used already in pipelines for now:

  • Namespace: processing.gbif.org
  • Names:
    • institutionCode: maps an institution code to a GrSciColl institution
    • collectionCode: maps a collection code to a GrSciColl collection
    • collectionToInstitutionCode: maps a collection code to a GrSciColl institution. I'd probably rename it to collectionCodeToInstitution (TBD).
    • institutionToCollectionCode: maps an institution code to a GrSciColl collection. I'd probably rename it to institutionCodeToCollection (TBD).

The value of the tags should follow the pattern {key}:{code}.

@MortenHofft
Copy link
Member

MortenHofft commented Jul 23, 2020

It looks good - I'm very curious to see it applied to actual data.

verbose or not
Should we have a "verbose" option like in the species match?

If we imagine anyone but staff using this, it might be useful to just have a match or none option. Say Plazi using it to create links from collection codes in articles.

// plain lookup - not verbose - will at most return one institution and one collection.
{
  institutionMatch: {
    matchType: 'NONE'
  },
  {
    collectionMatch: {
      matchType: 'FUZZY',
      reasons: ['SAME_NAME', 'SAME_CODE', 'SAME_COUNTRY'],
      entity: {
        ...
      }
    }
  }
}

// verbose option
// could be like the one running in dev
{
  "institutionMatches": [],
  "collectionMatches": [
    ...
  ]
}

on naming
the species match API use matchType instead of type and the clusters use reasons instead of remarks.

Real data
Before running it on real data I wonder if it would be worth just checking the top 500 distinct combinations of institution code/id colelctionCode/ID and manually assess some of them? We might learn something (say we need to try to flip id and code)

What constitute a match?
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

Country as a disambiguator
Would it make sense to add country as a search param? When indexing occurrences we could add the country of the publisher to disambiguate when there are for example 2 collection matches? Or is that better done by the consumer iterating the results?

@MortenHofft
Copy link
Member

I just tried to match based on a csv extract I had from some time ago.
The csv is distinct institutionId, institutionCode, datasetKey, publisherKey with an occurrence count for each.

I took anything with more than 2500 occurrences and tried to match them against the service.

105,871,241 occurrences had a single match
 24,553,278 with an exact singular match
 27,113,941 occurrences had multiple matches
 34,998,066 occurrences had no match

2,374 combinations was tested against the service (multiple can be the same since it included dataset and publisher)

That isn't a bad start. I haven't evaluated the quality of the matches though.

@marcos-lg
Copy link
Contributor Author

What constitute a match?
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

It could be something like:

  • For institutions
    • Only one exact match
    • Only one fuzzy match
  • For collections
    • Only one exact match
    • If there was an institution match, only one fuzzy match whose institution is the same as the institution matched
    • If there wasn't an institution match, only one fuzzy match

Do you think we should also provide an overall match status?

@MortenHofft
Copy link
Member

What constitute a match?
When do you imagine this will trigger a match?

  • Exact only?
  • Fuzzy but only one result
  • Exact institution, but only one fuzzy collection?

For this you mean for the non-verbose version where we show only 1 match, right?

I meant when using it in pipelines for assigning GrSciColl IDs to occurrences. I had imagined that this service included the decision and all logic. How does it work for other lookup services?


That reminds me, you mentioned the other day that you considered adding all the matched IDs to the occurrence index.
I guess there are 2 possible versions:

  • Adding all possible candidates. That effectively push the burden to the UI or user. And the same specimens would appear under multiple collections.
  • Only add a link when we have one confident match. The service take the responsibility of the statement. We will be able to match fewer.

I'm more in favour of version 2. Only adding a GrSciColl ID to an occurrence when we have 1 confident match. Not an array of candidate matches. And if we want more matches, then we address publishers to add better identifiers to either GrSciColl or the occurrences. Or we add machine tags to the datasets in case.

If it is useful to have all candidates indexed, could we then consider a separate field for it?

@MortenHofft
Copy link
Member

Should The service return flags. FUZZY COLLECTION CODE MATCH. NO COLLECTION CODE MATCH. Similar to species match service.

@marcos-lg
Copy link
Contributor Author

marcos-lg commented Jul 24, 2020

I can think of these flags:

  • AMBIGUOUS_INSTITUTION: more than 1 institution was found and we couldn't break the tie
  • AMBIGUOUS_COLLECTION: same as above but for collections
  • FUZZY_INSTITUTION_MATCH: 1 institution matched but fuzzily
  • FUZZY_COLLECTION_MATCH: same as above but for collections
  • INSTITUTION_NAME_USED: the institutionCode field contains the institution name instead of the code
  • OWNER_INSTITUTION_NAME_USED: same as above but for the owner institution
  • COLLECTION_NAME_USED: same as above but for collections
  • NO_COLLECTION_CODE_MATCH: the code provided didn't match
  • NO_INSTITUTION_CODE_MATCH: same as above
  • NO_COLLECTION_ID_MATCH: the ID provided didn't match
  • NO_INSTITUTION_ID_MATCH: same as above
  • INSTITUTION_COLLECTION_MISMATCH: the collection found doesn't belong to the institution matched

EDIT: the INSTITUTION_NAME_USED ones maybe can be removed and just used the FUZZY_INSTITUTION_MATCH for these cases. I don't know what would be more useful for publishers

@MortenHofft
Copy link
Member

I like it - it is my impression that many publishers appreciate those flags and act on them. This will give them the insights to modify data and improve the matching.

marcos-lg added a commit that referenced this issue Aug 3, 2020
marcos-lg added a commit that referenced this issue Aug 3, 2020
* #204 implementation of the lookup service without using machine tags yet

* #204 implementation of the lookup service without using machine tags yet

* updated gbif-api version

* fixed ITs

* fixed ITs

* #204 added machine tags check in collections lookup

* #204 added verbose and country params + divided matches and alternative matches

* #204 added verbose and country params + divided matches and alternative matches

* #204 check for non-existing datasets
@marcos-lg
Copy link
Contributor Author

marcos-lg commented Aug 3, 2020

The service now returns a response like:

{
  "institutionMatch": {
    ...
  },
  "collectionMatch": {
    ...
  },
  alternativeMatches { 
    institutionMatches: []
    collectionMatches: []
  }
}

The alternative matches are only shown if the verbose parameter is set to true. The fuzzy matches are limited to 20 results for performance reasons.

It was also added a Country parameter used to break ties: http://api.gbif-dev.org/v1/grscicoll/lookup?institutionCode=BR&country=BE&verbose=true

A match happens if any of these conditions are met:

  • There's only 1 machine tag match
  • There's only 1 exact match
  • There are multiple exact matches but only one matches the country parameter received
  • There's only 1 fuzzy match
  • There are multiple fuzzy matches but only one matches at least the code or the id and one more field (name or alternative code)
  • There are multiple fuzzy matches but only one matches the country parameter received

Additionally, institutions whose owner institution is different than the institution are not considered a match. Also, collections whose institution doesn't match the institution accepted match are also not considered a match.

I haven't added the flags but a status field instead:

  • ACCEPTED: accepted match
  • AMBIGUOUS: more than 1 result was found and we couldn't break the tie
  • AMBIGUOUS_MACHINE_TAGS: same as above but with machine tag matches
  • AMBIGUOUS_OWNER: there are results but don't match with the institution owner so we skip them not to link on loan collections
  • AMBIGUOUS_INSTITUTION_MISMATCH: there are fuzzy matches but don't belong to the institution matched
  • DOUBTFUL: the match found is fuzzy

The rest of the flags can be inferred from the reasons field of the match. Issues can be set from this field in pipelines.

@marcos-lg
Copy link
Contributor Author

marcos-lg commented Aug 12, 2020

I've extracted from our data in PROD combinations of these fields that are present in more than 1000 records:

  • v_institutionid
  • v_institutioncode
  • v_ownerinstitutioncode
  • v_collectioncode
  • v_collectionid
  • datasetkey

Additionally, I took the country from the publishing organization of the dataset.

Then I passed them to the lookup service in UAT. The results are in this spreadhseet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff question
Projects
None yet
Development

No branches or pull requests

2 participants