Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import iDigBio collections into GrSciColl #169

Closed
5 tasks done
ManonGros opened this issue Feb 5, 2020 · 12 comments
Closed
5 tasks done

Import iDigBio collections into GrSciColl #169

ManonGros opened this issue Feb 5, 2020 · 12 comments
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@ManonGros
Copy link
Contributor

ManonGros commented Feb 5, 2020

Goal(s)

What needs to happen before the actual import

We could do these in a different order of course.

1. Link iDigBio and GrSciColl entries

Since iDigBio describes collections, we should probably:

  1. Match the iDigBio entries to the GrSciColl collections (based on title, code, etc.)
  2. If no match can be found in collections, we should try to find out if the corresponding iDigBio institution is available in GrSciColl.
  3. If we cannot find any match in the GrSciColl collections and institutions, I think we should create both an institution and a collection attached to it (similar to what we talked about in the case of Index Herbariorum: Synchronize with Index Herbariorum - Collections and institutions #167). Does it make sense?

Once we have a list of matches, we could add identifiers to the GrSciColl entries to work on the import (similar to what we do in the case of IH).

Who should do the matching: iDigBio or GBIF?

Everyone probably has an idea on how to proceed but for the sake of tracking what is happening, I am writing here the steps of the matching process:

Now who will do what?

2. Agree on the mapping of iDigBio and GrSciColl fields

The models between iDigBio and GrSciColl seem pretty similar. Here is how we propose to map the fields. Could you go over this and let us know if you have any comment?

iDiBio GrSciColl
Institution Mapped to "Institution" in Collection entity and "Name" if used create an institution
Collection Name in Coll
Recordsets Set as a MachineTag (since it is for internal use) in coll
RecordsetQuery MachineTag in coll
Institution Code Mapped to "Code" in Institution
Collection Code Mapped to "Code" in Collection
Collection Uuid Added as an identifier
Collection Lsid Added as an identifier
Collection Url Homepage in Coll
Collection Catalog Url Catalogue URL in Coll
Description Description in Coll
DescriptionForSpecialists Concatenated to Description in Coll (or new field?)
CataloguedSpecimens Number of Specimen in Coll
KnownToContainTypes Discard? (the field is used less than 100 times) Is it necessary for internal use? In that case, we can add it as a machineTag.
TaxonCoverage Taxonomic coverage in Coll
Geographic Range Geographic coverage in Coll
CollectionExtent Discard? (it seems like in most cases it contains a string with the same value as cataloguedSpecimens)
Contact Mapped to Staff Name
Contact Role Mapped to Staff Position
Contact Email Mapped to Staff Email
Mailing Address Mailing Address in Coll
Mailing City Mailing City in Coll
Mailing State Mailing State in Coll
Mailing Zip Mailing Postal Code in Coll
Physical Address Physical Address in Coll
Physical City Physical City in Coll
Physical State Physical State in Coll
Physical Zip Physical Postal Code in Coll
UniqueNameUUID Added as identifier in inst
AttributionLogoURL New field?
ProviderManagedID Added as identifier
DerivedFrom Added as MachineTag if it is for internal use?
SameAs Added as identifier
Flags Added as MachineTag
PortalDisplay Added as MachineTag
Lat Latitude in Institution
Lon Longitude in Institution

3. Decide what to do when there is an overlap between IH and iDigBio

As mentioned earlier, we are working on synchronising Index Herbariorum and GrSciColl (#167). There is a partial overlap between iDigBio and IH.

What should we do in these cases?
I suggest to overwrite the information for the fields provided by IH (IH value overwrite iDigBio or GrSciColl value) and keep the fields that are from iDigBio only.
If the iDigBio record is the most up to date, we would create a GitHub issue and then send the latest update to IH.
Would that be ok?

@roncanepa
Copy link

regarding part 1:

As far as who performs the work, I respectfully think it would be best and most expedient if GBIF is able to devote the time to this. iDigBio/ACIS IT is still short by 1 team member and, despite our feelings that the resulting product will work much better for everyone, I don't think we could guarantee that we'd be able to commit to it anytime soon.

Here are some other notes for section 1 of this issue:

  • 1-3 on your list make sense, including the proposed solution in 3 for if no matches can be found

  • for matching, it might be possible to match from GBIF's institution code to collections.json institution code

  • based on existing documentation of collections.json (in the repo readme), the institution_lsid is mapped to a "GRBio LSID or coolURI for the institution LSID" if found, otherwise is blank

  • other matches will likely need to be string-based match algorithms. A potentially helpful note for matching/verification purposes is that the recordset uuid in collections.json will match the recordset uuid served from our API.

@nrejackufl
Copy link

Part 2:
The individual records in iDigBio’s collections.json are Institution-Collection records. GBIF properly breaks Institution and Collection out into separate entities. See attached diagram for intended hierarchy.

unnamed

Note: there are field definitions in the readme of: https://github.com/iDigBio/idb-us-collections

Comments on individual mappings:

“UniqueNameUUID Added as identifier” - this appears to be intended as an "institution" UUID in a hierarchy of iDigBio records but does not seem to have been implemented. Keep as identifier in GBIF system.

recordsetQuery: This generates a link to the iDigBio recordset, (i.e., https://www.idigbio.org/portal/recordsets/ea12da76-1b2e-4944-8709-1de3af1c65e2). This field can be discarded if you are generating links to the recordset another way.

Recordsets - Reminder: this is our parent object for individual records in our system

KnownToContainTypes: this seems okay to discard.

Collectionextent: can be copied into CatalogedSpecimens where the CatalogedSpecimens is blank, but not required to keep as a separate field (discard).

“attributionLogoURL, providerManagedID, derivedFrom” - note that these are Audubon Core terms

@roncanepa
Copy link

Regarding part 3:

We are okay with the proposed method of integrating IH and iDigBio data. To help determine who the most recent record, IH or iDigBio, you can use the commit date for an individual file in the iDigBio repo as an added/modified date.

The way that repository works is that a human creates/updates a chunk of json named ./collections/{collection_uuid}.json and commits. The software workflow then runs tests and aggregates that json chunk into the full collections.json. An example individual json file would be:

https://github.com/iDigBio/idb-us-collections/blob/master/collections/001c5234-048b-11e5-b0ee-002315492bbc

@roncanepa
Copy link

Important Note: The collections.json file that actually gets loaded and used is served from the json-index or gh-pages branch (it gets pushed to both) and not the master branch. For instance:

https://raw.githubusercontent.com/iDigBio/idb-us-collections/json-index/collections.json

or

http://idigbio.github.io/idb-us-collections/collections.json

I hope that all of this helps. Please feel free to @ us for additional questions or clarification.

@marcos-lg
Copy link
Contributor

marcos-lg commented Mar 13, 2020

@roncanepa @nrejack I was checking the mappings and looks like AttributionLogoURL is the only iDigBio field we're missing in our registry. But I checked the collections.json file and noticed that this field is always empty. Should we still add it to our registry? or we can discard it too?

@nrejackufl
Copy link

@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.

@ManonGros
Copy link
Contributor Author

Many thanks for your replies @roncanepa and @nrejack !
In that case, we will get started on [1. Link iDigBio and GrSciColl entries]. We will do as much as possible automatically and send you and Cat some things that might need manual checking, is that ok with you?

@CatChapman
Copy link

Fine with me, send away! Thanks so much, everyone!!

@ManonGros
Copy link
Contributor Author

Hey @CatChapman, Morten has been working on matching iDigBio and GrSciColl entries: #187
It turns out that it makes more sense to match first everything to GrSCiColl institutions because these are the entries for which we have a lot more details and identifiers. Then once we got the matches for institution, we could take a look at the collections and match these as well.

Morten described his whole matching process and results on the issue linked above but here are the highlights:

  1. Match the iDigBio entries based on the IRN
  2. Match left iDigBio entries based on other identifiers
  3. Match left iDigBio entries based on title and code (note that the titles were processed to facilitate the matching)
  4. Match left iDigBio entries based on city and code
  5. Match left iDigBio entries based title alone when there are no iDigBio institution code
  6. Match left iDigBio entries based title (despite conflicting codes)
  7. Match left iDigBio entries manually

This leaves 235 iDigBio entries unmatched for which we would create new entries in GrSciColl.
Now we need your help to check the matching! Could you go over #187 and take a look at the matching result? (We can also provide you with a spreadsheet if it is more convenient).

Note that we might have some duplicate collections at the beginning as some collection titles can be a bit vague in GrSciColl and we don't always have reliable codes. No worries, we expect to iron them out a bit later.

Morten also documented how we expect to do the merging itself here: #188

@CatChapman
Copy link

@ManonGros WOW! This is great. You guys rock, so much.

A spreadsheet would be fantastic - I just emailed you, so feel free to send it there, or link to it (if it's a Google Sheet, etc) in here.

Will take a peek at #188 now.

@ManonGros
Copy link
Contributor Author

ManonGros commented Apr 6, 2020

Great! I am adding the tab-separated CSV file for the matching:
iDigBio_GrSciColl_matches_march2020.tsv.zip

If would be great to get back your check in a machine readable format. We suggest to add a column to this file with true/false for each match along with a potential "correction" column with the corresponding match you believe to be true.

@ManonGros
Copy link
Contributor Author

Morten's JSON file updated with input from CAT:
iDigBio_Morten_matches_AND_Cat_addition.json.zip

@marcos-lg marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

5 participants