Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize with Index Herbariorum - Collections and institutions #167

Closed
ManonGros opened this issue Jan 23, 2020 · 8 comments
Closed

Synchronize with Index Herbariorum - Collections and institutions #167

ManonGros opened this issue Jan 23, 2020 · 8 comments
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@ManonGros
Copy link
Contributor

ManonGros commented Jan 23, 2020

Before we start

These are my assumptions about the GrSciColl registry:

  1. We want to use GrSciColl to link the institutions and collections to the specimens available on GBIF (mainly via the institutions and collections codes).
  2. We want to avoid repeating/duplicating the efforts of other registries. Since some the data is already maintained by IH, we want to use IH to maintain the information as much as possible.
  3. We want entries in GrSciColl to be stable, with clear identifiers in order to promote and link citations to.

Option 1: Always Map IH to Institutions

Right now, entries in IH describe mostly institutions.
In the context of IH, it makes sense since we are talking about herbaria only. The problem is that GrSciColl is a broader context where the herbaria/botany part of an institution cannot always represent an institution.

Example of resulting issues

Let's take an example that illustrate the problem: UWO

  • The entry in IH describes a herbarium.
  • The Institution entry in GrSciColl (which is based on IH) also describes a herbarium (for example: "Vascular plants, bryophytes, and fungi, including lichens").
  • However, I was asked to add an entomology collection to this institution of 75,000 specimens.

An other example would be ANSP which also has an arthropod collection but is described a diatom herbarium in GrSciColl and in IH.

An other type of problem is the conflicts of information. We have some cases where the description of an institution on GrSciColl is more generic. For example, in the case of LUX, the information was rearranged on GrSciColl:

Possible solutions

With IH mapped to institutions in GrSciColl, we have two possible solutions:

  • Add the arthropod collections to the corresponding institutions, even though these institutions are described as herbaria because GrSciColl got the information from IH. In this case, the information for that institution is maintained in IH.
  • Create new institutions and add the arthropod collections to these new institutions. This also means duplicating some the information (including the code) which means that it would be difficult to maintain and hard to map to specimens.

Option 2: Map IH entries to collections

Conceptually, it would make more sense for herbaria to be collections in GrSciColl.
In a way, they are, "botany collections".
By this, I mean that each IH entry should be a collection attached to an institution. More ideas on how this could work below.

Advantages

Overall, I think it could make GrSciColl more coherent:

  • We are less likely to have conflicting information about an institution (everything herbarium-specific could be described at the collection level).
  • We wouldn't have to duplicate institution codes to accommodate for herbarium and non-herbarium part of one institution.
  • Other non-botany collections could be added to the same institution.

How this could work

This is just some ideas to be discussed. Here is what we could try to achieve:

  1. Each IH entry will make a collection attached to an institution in GrSciColl.
  2. If the GrSciColl institution doesn't exist, create one from information available in IH (name, code, address, etc. everything but taxonomic coverage and other collection specific info). Some info, such as address might be the same between collection and institution when that happens but it is ok.
  3. The institution code and collection code can be the same unless specified otherwise.
  4. When synchronising, unless specified otherwise (with a tag or a checkbox?), the info from IH can update both collection and institution. Otherwise, only the collection is updated.

I tried to illustrate this with the ANSP example:
idea_IH_synch 001

Obviously, this would be far from perfect, but this makes more sense to me that mapping everything to institutions. Any thoughts on this? Did I forget anything?

Issue related: #159

@ManonGros
Copy link
Contributor Author

ManonGros commented Jan 23, 2020

To have an idea on how this would impact linking GBIF specimens to collections and institutions, I checked a few botanical collections on GBIF.
Here are these collections and the codes they use:

It seems like the tendency is to use mostly the same code for institution and collection (or skip one of them).

@MortenHofft
Copy link
Member

MortenHofft commented Jan 24, 2020

Just to iterate option 2: so assuming the codes match IH and are unique in GrSciColl that would mean:

  • MO: everything good. We can still match.
  • B: we can only match the herbarium occurrences to the institution.
  • P: Opposite of B. If a unique collection code we can still match, but else we cannot link occ to GrSciColl as the instCode differs and cannot be used to differentiate.

The important part is below i guess:

The institution code and collection code can be the same unless specified otherwise.

  • B can be fixed in several ways:
    • Change the collection code i GrSciColl (No longer in sync with IH).
    • Change the collection code i GrSciColl (Keep syncing with IH, but lock the code to "Herbarium Berolinense").
    • Add a dataset machine tag stating that collection code "Herbarium Berolinense" should be interpreted as collection code "B".
  • P can be fixed by changing the institutionCode in GrSciColl and breaking that part of the link to IH.

@ManonGros
Copy link
Contributor Author

Logic for creating entities

  1. update GrSciColl entities only if we have IRN in the identifier (otherwise skip). It can mean that IH can update both an institution and a collection.
  2. if there are several matches of the same kind, create GitHub issue.
  3. for an IH entity, if no collection in GrSciColl but institution in GrSciColl, create just collection attached to existing institution + update institution.
  4. for an IH entity, if no collection in GrSciColl and no institution in GrSciColl, create institution (see details of fields below) and create collection + put IRN for both.
  5. for staff members, link or unlink staff members for institutions and collections that have IRN.

Details of fields to update

IH fields:
Name: Inst + Coll
Herbarium Code: Inst + Coll when creating new entry AND create GitHub issue if code different when updating entries (skip update in that case)
Current Status: Inst + Coll
Correspondents: see staff
Contact: Inst + Coll
Address: Inst + Coll
Coordinates: Inst
URL: Inst + Coll
Taxonomic Coverage: Coll
Geography: Coll
Notes: Coll
Number of Specimens: Coll
Date Founded: Inst
Incorporated Herbaria: Coll (create a field for it "Incorporated collection")
Important Collectors: leave it out for now but in the future Coll with maybe new field?
TABLE of Specimens/collection: leave it out for now but in the future Coll with maybe new field?

Identifiers

Keep the identifiers where they are.

When will we decide to use DOIs??

@marcos-lg
Copy link
Contributor

The next fields will be added to the GrSciColl Collection entity in order to map some of the fields from IH:

  • taxonomicCoverage
  • geography
  • notes
  • incorporatedCollections
  • importantCollectors
  • collectionSummary (it will be a list of key-value pairs)

@MortenHofft
Copy link
Member

@asturcon what is the types for these fields? single line strings, text blocks, markdown, numbers, uuids?

@marcos-lg
Copy link
Contributor

@MortenHofft
Copy link
Member

MortenHofft commented Mar 14, 2020

same institutions is appearing multiple times in IH and hence GrSciColl.
E.g.
http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126771
http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126772

This seem to be a case of IH not splitting institutions and collections and hence have to create 2 entities with the same information, simply to have 2 codes for the 2 collections from the institution.

Now that we have decided (in agreement with IH) to always create an implicit collection, we can arguably delete one of the institutions.
After syncing with IH and before adding/syncing more data (iDigBio), perhaps we should run a deduplication on the institutions in GrSciColl? It is also possible to do at a later stage, we might just need to merge more data at that point.

@marcos-lg marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Apr 29, 2020
@marcos-lg
Copy link
Contributor

In production and scheduled to run weekly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

3 participants