New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import iDigBio collections into GrSciColl #169
Comments
regarding part 1: As far as who performs the work, I respectfully think it would be best and most expedient if GBIF is able to devote the time to this. iDigBio/ACIS IT is still short by 1 team member and, despite our feelings that the resulting product will work much better for everyone, I don't think we could guarantee that we'd be able to commit to it anytime soon. Here are some other notes for section 1 of this issue:
|
Part 2: Note: there are field definitions in the readme of: https://github.com/iDigBio/idb-us-collections Comments on individual mappings: “UniqueNameUUID Added as identifier” - this appears to be intended as an "institution" UUID in a hierarchy of iDigBio records but does not seem to have been implemented. Keep as identifier in GBIF system. recordsetQuery: This generates a link to the iDigBio recordset, (i.e., https://www.idigbio.org/portal/recordsets/ea12da76-1b2e-4944-8709-1de3af1c65e2). This field can be discarded if you are generating links to the recordset another way. Recordsets - Reminder: this is our parent object for individual records in our system KnownToContainTypes: this seems okay to discard. Collectionextent: can be copied into CatalogedSpecimens where the CatalogedSpecimens is blank, but not required to keep as a separate field (discard). “attributionLogoURL, providerManagedID, derivedFrom” - note that these are Audubon Core terms |
Regarding part 3: We are okay with the proposed method of integrating IH and iDigBio data. To help determine who the most recent record, IH or iDigBio, you can use the commit date for an individual file in the iDigBio repo as an added/modified date. The way that repository works is that a human creates/updates a chunk of json named ./collections/{collection_uuid}.json and commits. The software workflow then runs tests and aggregates that json chunk into the full collections.json. An example individual json file would be: |
Important Note: The https://raw.githubusercontent.com/iDigBio/idb-us-collections/json-index/collections.json or http://idigbio.github.io/idb-us-collections/collections.json I hope that all of this helps. Please feel free to @ us for additional questions or clarification. |
@roncanepa @nrejack I was checking the mappings and looks like |
@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it. |
Many thanks for your replies @roncanepa and @nrejack ! |
Fine with me, send away! Thanks so much, everyone!! |
Hey @CatChapman, Morten has been working on matching iDigBio and GrSciColl entries: #187 Morten described his whole matching process and results on the issue linked above but here are the highlights:
This leaves 235 iDigBio entries unmatched for which we would create new entries in GrSciColl. Note that we might have some duplicate collections at the beginning as some collection titles can be a bit vague in GrSciColl and we don't always have reliable codes. No worries, we expect to iron them out a bit later. Morten also documented how we expect to do the merging itself here: #188 |
@ManonGros WOW! This is great. You guys rock, so much. A spreadsheet would be fantastic - I just emailed you, so feel free to send it there, or link to it (if it's a Google Sheet, etc) in here. Will take a peek at #188 now. |
Great! I am adding the tab-separated CSV file for the matching: If would be great to get back your check in a machine readable format. We suggest to add a column to this file with true/false for each match along with a potential "correction" column with the corresponding match you believe to be true. |
Morten's JSON file updated with input from CAT: |
Goal(s)
What needs to happen before the actual import
We could do these in a different order of course.
1. Link iDigBio and GrSciColl entries
Since iDigBio describes collections, we should probably:
Once we have a list of matches, we could add identifiers to the GrSciColl entries to work on the import (similar to what we do in the case of IH).
Who should do the matching: iDigBio or GBIF?
Everyone probably has an idea on how to proceed but for the sake of tracking what is happening, I am writing here the steps of the matching process:
Now who will do what?
2. Agree on the mapping of iDigBio and GrSciColl fields
The models between iDigBio and GrSciColl seem pretty similar. Here is how we propose to map the fields. Could you go over this and let us know if you have any comment?
3. Decide what to do when there is an overlap between IH and iDigBio
As mentioned earlier, we are working on synchronising Index Herbariorum and GrSciColl (#167). There is a partial overlap between iDigBio and IH.
What should we do in these cases?
I suggest to overwrite the information for the fields provided by IH (IH value overwrite iDigBio or GrSciColl value) and keep the fields that are from iDigBio only.
If the iDigBio record is the most up to date, we would create a GitHub issue and then send the latest update to IH.
Would that be ok?
The text was updated successfully, but these errors were encountered: