Import iDigBio collections into GrSciColl #169

ManonGros · 2020-02-05T10:18:55Z

Goal(s)

We want to migrate the content of the iDigBio collections: https://www.idigbio.org/portal/collections to GrSciColl.
The content will then be curated directly our registry console.
iDigBio will use the GBIF collection API to access this content.

What needs to happen before the actual import

We could do these in a different order of course.

1. Link iDigBio and GrSciColl entries

Since iDigBio describes collections, we should probably:

Match the iDigBio entries to the GrSciColl collections (based on title, code, etc.)
If no match can be found in collections, we should try to find out if the corresponding iDigBio institution is available in GrSciColl.
If we cannot find any match in the GrSciColl collections and institutions, I think we should create both an institution and a collection attached to it (similar to what we talked about in the case of Index Herbariorum: Synchronize with Index Herbariorum - Collections and institutions #167). Does it make sense?

Once we have a list of matches, we could add identifiers to the GrSciColl entries to work on the import (similar to what we do in the case of IH).

Who should do the matching: iDigBio or GBIF?

Everyone probably has an idea on how to proceed but for the sake of tracking what is happening, I am writing here the steps of the matching process:

Getting the data from iDigBio (from here: http://idigbio.github.io/idb-us-collections/collections.json)
Getting the data from GrSciColl (most likely with the collection API)
Clean up the data (using OpenRefine for example)
Use your favorite algorithm to match the data with the relevant fields.
Check manually the fuzzy/suspicious matches.

Now who will do what?

2. Agree on the mapping of iDigBio and GrSciColl fields

The models between iDigBio and GrSciColl seem pretty similar. Here is how we propose to map the fields. Could you go over this and let us know if you have any comment?

iDiBio	GrSciColl
Institution	Mapped to "Institution" in Collection entity and "Name" if used create an institution
Collection	Name in Coll
Recordsets	Set as a MachineTag (since it is for internal use) in coll
RecordsetQuery	MachineTag in coll
Institution Code	Mapped to "Code" in Institution
Collection Code	Mapped to "Code" in Collection
Collection Uuid	Added as an identifier
Collection Lsid	Added as an identifier
Collection Url	Homepage in Coll
Collection Catalog Url	Catalogue URL in Coll
Description	Description in Coll
DescriptionForSpecialists	Concatenated to Description in Coll (or new field?)
CataloguedSpecimens	Number of Specimen in Coll
KnownToContainTypes	Discard? (the field is used less than 100 times) Is it necessary for internal use? In that case, we can add it as a machineTag.
TaxonCoverage	Taxonomic coverage in Coll
Geographic Range	Geographic coverage in Coll
CollectionExtent	Discard? (it seems like in most cases it contains a string with the same value as cataloguedSpecimens)
Contact	Mapped to Staff Name
Contact Role	Mapped to Staff Position
Contact Email	Mapped to Staff Email
Mailing Address	Mailing Address in Coll
Mailing City	Mailing City in Coll
Mailing State	Mailing State in Coll
Mailing Zip	Mailing Postal Code in Coll
Physical Address	Physical Address in Coll
Physical City	Physical City in Coll
Physical State	Physical State in Coll
Physical Zip	Physical Postal Code in Coll
UniqueNameUUID	Added as identifier in inst
AttributionLogoURL	New field?
ProviderManagedID	Added as identifier
DerivedFrom	Added as MachineTag if it is for internal use?
SameAs	Added as identifier
Flags	Added as MachineTag
PortalDisplay	Added as MachineTag
Lat	Latitude in Institution
Lon	Longitude in Institution

3. Decide what to do when there is an overlap between IH and iDigBio

As mentioned earlier, we are working on synchronising Index Herbariorum and GrSciColl (#167). There is a partial overlap between iDigBio and IH.

What should we do in these cases?
I suggest to overwrite the information for the fields provided by IH (IH value overwrite iDigBio or GrSciColl value) and keep the fields that are from iDigBio only.
If the iDigBio record is the most up to date, we would create a GitHub issue and then send the latest update to IH.
Would that be ok?

roncanepa · 2020-03-12T19:02:13Z

regarding part 1:

As far as who performs the work, I respectfully think it would be best and most expedient if GBIF is able to devote the time to this. iDigBio/ACIS IT is still short by 1 team member and, despite our feelings that the resulting product will work much better for everyone, I don't think we could guarantee that we'd be able to commit to it anytime soon.

Here are some other notes for section 1 of this issue:

1-3 on your list make sense, including the proposed solution in 3 for if no matches can be found
for matching, it might be possible to match from GBIF's institution code to collections.json institution code
based on existing documentation of collections.json (in the repo readme), the institution_lsid is mapped to a "GRBio LSID or coolURI for the institution LSID" if found, otherwise is blank
other matches will likely need to be string-based match algorithms. A potentially helpful note for matching/verification purposes is that the recordset uuid in collections.json will match the recordset uuid served from our API.

nrejackufl · 2020-03-12T19:06:14Z

Part 2:
The individual records in iDigBio’s collections.json are Institution-Collection records. GBIF properly breaks Institution and Collection out into separate entities. See attached diagram for intended hierarchy.

Note: there are field definitions in the readme of: https://github.com/iDigBio/idb-us-collections

Comments on individual mappings:

“UniqueNameUUID Added as identifier” - this appears to be intended as an "institution" UUID in a hierarchy of iDigBio records but does not seem to have been implemented. Keep as identifier in GBIF system.

recordsetQuery: This generates a link to the iDigBio recordset, (i.e., https://www.idigbio.org/portal/recordsets/ea12da76-1b2e-4944-8709-1de3af1c65e2). This field can be discarded if you are generating links to the recordset another way.

Recordsets - Reminder: this is our parent object for individual records in our system

KnownToContainTypes: this seems okay to discard.

Collectionextent: can be copied into CatalogedSpecimens where the CatalogedSpecimens is blank, but not required to keep as a separate field (discard).

“attributionLogoURL, providerManagedID, derivedFrom” - note that these are Audubon Core terms

roncanepa · 2020-03-12T19:09:20Z

Regarding part 3:

We are okay with the proposed method of integrating IH and iDigBio data. To help determine who the most recent record, IH or iDigBio, you can use the commit date for an individual file in the iDigBio repo as an added/modified date.

The way that repository works is that a human creates/updates a chunk of json named ./collections/{collection_uuid}.json and commits. The software workflow then runs tests and aggregates that json chunk into the full collections.json. An example individual json file would be:

https://github.com/iDigBio/idb-us-collections/blob/master/collections/001c5234-048b-11e5-b0ee-002315492bbc

roncanepa · 2020-03-12T19:13:21Z

Important Note: The collections.json file that actually gets loaded and used is served from the json-index or gh-pages branch (it gets pushed to both) and not the master branch. For instance:

https://raw.githubusercontent.com/iDigBio/idb-us-collections/json-index/collections.json

or

http://idigbio.github.io/idb-us-collections/collections.json

I hope that all of this helps. Please feel free to @ us for additional questions or clarification.

marcos-lg · 2020-03-13T15:12:17Z

@roncanepa @nrejack I was checking the mappings and looks like AttributionLogoURL is the only iDigBio field we're missing in our registry. But I checked the collections.json file and noticed that this field is always empty. Should we still add it to our registry? or we can discard it too?

nrejackufl · 2020-03-13T15:20:52Z

@asturcon We picked this field up from Audubon Core, but we agreed that you can discard the field since we are not doing anything with it.

ManonGros · 2020-03-13T16:48:39Z

Many thanks for your replies @roncanepa and @nrejack !
In that case, we will get started on [1. Link iDigBio and GrSciColl entries]. We will do as much as possible automatically and send you and Cat some things that might need manual checking, is that ok with you?

CatChapman · 2020-03-13T18:04:19Z

Fine with me, send away! Thanks so much, everyone!!

ManonGros · 2020-04-03T08:49:44Z

Hey @CatChapman, Morten has been working on matching iDigBio and GrSciColl entries: #187
It turns out that it makes more sense to match first everything to GrSCiColl institutions because these are the entries for which we have a lot more details and identifiers. Then once we got the matches for institution, we could take a look at the collections and match these as well.

Morten described his whole matching process and results on the issue linked above but here are the highlights:

Match the iDigBio entries based on the IRN
Match left iDigBio entries based on other identifiers
Match left iDigBio entries based on title and code (note that the titles were processed to facilitate the matching)
Match left iDigBio entries based on city and code
Match left iDigBio entries based title alone when there are no iDigBio institution code
Match left iDigBio entries based title (despite conflicting codes)
Match left iDigBio entries manually

This leaves 235 iDigBio entries unmatched for which we would create new entries in GrSciColl.
Now we need your help to check the matching! Could you go over #187 and take a look at the matching result? (We can also provide you with a spreadsheet if it is more convenient).

Note that we might have some duplicate collections at the beginning as some collection titles can be a bit vague in GrSciColl and we don't always have reliable codes. No worries, we expect to iron them out a bit later.

Morten also documented how we expect to do the merging itself here: #188

CatChapman · 2020-04-03T15:02:02Z

@ManonGros WOW! This is great. You guys rock, so much.

A spreadsheet would be fantastic - I just emailed you, so feel free to send it there, or link to it (if it's a Google Sheet, etc) in here.

Will take a peek at #188 now.

ManonGros · 2020-04-06T09:45:51Z

Great! I am adding the tab-separated CSV file for the matching:
iDigBio_GrSciColl_matches_march2020.tsv.zip

If would be great to get back your check in a machine readable format. We suggest to add a column to this file with true/false for each match along with a potential "correction" column with the corresponding match you believe to be true.

ManonGros · 2020-04-17T10:08:38Z

Morten's JSON file updated with input from CAT:
iDigBio_Morten_matches_AND_Cat_addition.json.zip

marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Apr 29, 2020

ManonGros closed this as completed Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import iDigBio collections into GrSciColl #169

Import iDigBio collections into GrSciColl #169

ManonGros commented Feb 5, 2020 •

edited

roncanepa commented Mar 12, 2020

nrejackufl commented Mar 12, 2020

roncanepa commented Mar 12, 2020

roncanepa commented Mar 12, 2020

marcos-lg commented Mar 13, 2020 •

edited

nrejackufl commented Mar 13, 2020

ManonGros commented Mar 13, 2020

CatChapman commented Mar 13, 2020

ManonGros commented Apr 3, 2020

CatChapman commented Apr 3, 2020

ManonGros commented Apr 6, 2020 •

edited

ManonGros commented Apr 17, 2020

Import iDigBio collections into GrSciColl #169

Import iDigBio collections into GrSciColl #169

Comments

ManonGros commented Feb 5, 2020 • edited

Goal(s)

What needs to happen before the actual import

1. Link iDigBio and GrSciColl entries

Who should do the matching: iDigBio or GBIF?

2. Agree on the mapping of iDigBio and GrSciColl fields

3. Decide what to do when there is an overlap between IH and iDigBio

roncanepa commented Mar 12, 2020

nrejackufl commented Mar 12, 2020

roncanepa commented Mar 12, 2020

roncanepa commented Mar 12, 2020

marcos-lg commented Mar 13, 2020 • edited

nrejackufl commented Mar 13, 2020

ManonGros commented Mar 13, 2020

CatChapman commented Mar 13, 2020

ManonGros commented Apr 3, 2020

CatChapman commented Apr 3, 2020

ManonGros commented Apr 6, 2020 • edited

ManonGros commented Apr 17, 2020

ManonGros commented Feb 5, 2020 •

edited

marcos-lg commented Mar 13, 2020 •

edited

ManonGros commented Apr 6, 2020 •

edited