Synchronize with Index Herbariorum #159

timrobertson100 · 2019-12-11T14:26:13Z

Index Herbariorum is an authoritative catalog which should be the master source for Herbaria entities. Herbaria records in the registry should be kept in sync with the ongoing editing efforts of IH.

This first iteration of work is deliberately scoped to accommodate the minimal functionality needed to achieve this. Once complete, additional feature requests can be opened as new issues.

It is envisaged the general synchronization will operate as follows:

Retrieve all Herbaria from IndexHerbariorum
For each entity locate the equivalent Institution or Collection in GRSciColl using the IH IRN
- If the entity exists and they differ, update GrSciColl
- If the entity does not exist, insert it as an institution and with an identifier holding the IH IRN
- If there is a conflict (e.g. multiple options) notify editors for resolution
Create, update or delete the associate staff members for the entities

A future version may allow the editing of IH entities in GRSciColl. Under that scenario when entities differ more complex logic is required, likely requiring notification to GRSciColl and IH staff to resolve the differences.

marcos-lg · 2020-01-06T16:21:03Z

The institutions from IH have a contact field which has only a phone, an email and a webUrl (http://sweetgum.nybg.org/science/api/v1/institutions/UARK). In a GrSciColl institution/collection we have a contacts field but they are actually person entities (http://api.gbif.org/v1/grscicoll/institution/f7068d69-cf88-42d8-a984-0c4de6ce8579 whose contact is http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af ).

What should we do with the IH contact? ignore it? add it as a GrSciColl person and link it to the institution/collection? for the latter the first name is required, so in that case we need to make up one.

I don't know who can answer this best @timrobertson100 @MortenHofft @ManonGros

MortenHofft · 2020-01-07T08:28:45Z

Those contact fields are not for a person. They are for the herbarium as an entity. So it is important as people come and go. I am quite sure that this would be considered essential from an IH standpoint and my feeling is that it is important as well. So I would suggest we extended our model instead. But better check with others as well.

As for people/staff. IH has an endpoint for those as well. They are - as far as I know - only linked by institution codes. In time we should sync those as well. But we might want to discuss more on our goal for handling contacts of this sort (ORCiD etc). @ManonGros do you have a preferred approach for this?

marcos-lg · 2020-01-07T09:36:08Z

I like the idea of extending our model.

ManonGros · 2020-01-07T09:53:05Z

For the herbarium contacts, I agree with you @MortenHofft , we should extend our model to have something like what we have for the GBIF publishing organisations (see for example "email":["geir@natsam.no"],"phone":["+47 99642071"] in http://api.gbif.org/v1/organization/b670ea7c-48e7-45e4-ba66-5bf01ee4d398).

For people/staff, I also agree, we should synchronise/import the people as well. Perhaps even before we synchronise the institutions? (I am just asking because it would seem logical to update the contacts when synchronising the IH institutions but this would require to have to staff/people up to date).

As far as I understand, for us staff/people can have a primary institution but be affiliated with several collections and institutions. While for IH, one person is associated with one institution code. Plus the information is a bit different (http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af and http://sweetgum.nybg.org/science/ih/person-details/?irn=131429).

For synchronising people/staff, should we proceed as we do for the institution? Meaning, checking matching semi-automatically first. If yes, how could we link them? There is no identifier or machine tags for people. Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one. And even for those who have one, we need to find them first.

I don't know if it is possible at all, but ideally I imagine something like that:

Find potential ORCiD for all the GrSciColl staff/people (if we have confirmation that the ORCiD is correct for a given person, synchronise with this in priority)
Match and link the IH person list with the GrSciColl staff/people
Update the GrSciColl staff entries if older than IH
Synchronise the GrSciColl institutions with IH (based on the identifiers we use to link them after our matching/checking, e.g what we did in UAT)

I know it is not that simple, let me know what you think.

marcos-lg · 2020-01-07T09:59:21Z

About the staff is already in the description of this task so I was planning to sync them in this process. I don't think we need to do something manually.

EDIT: when I said I don't think we need to do something manually, I meant I will try to match them using the name, email or any other representative field (I did something similar in the last DB migration, even though the matching is not perfect because there are a lot of staff duplicated but just with different address or phone) and if I can't match to any existing one I will create a new one. Still this matching won't be perfect as I mentioned before, if we want it to be more accurate then we need some manual editing.

kcopas · 2020-01-08T07:31:49Z

Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one.

As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline (see Study of ORCID Adoption Across Disciplines and Locations). Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do.

Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach…

timrobertson100 · 2020-01-08T07:41:00Z

I suggest we move ORCID related ideas to a new issue to not conflate things. This ticket is specifically to get GrSciColl and IH in sync (adding links to ORCID accounts is desirable but not necessary)

…

On 8 Jan 2020, at 14:31, Kyle Copas ***@***.***> wrote: Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one. As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline. Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do. Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach… — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ManonGros · 2020-01-08T10:38:19Z

Something else to take into account for the synchronisation:
On the long term, we want IH records to be directly edited in IH and then synchronised with GrSciColl.
But right now, we have a handful of editors who have been editing their GrSciColl records already. Which means that GrSciColl contains the most updated information about a collection/institution not IH.
See this example:

http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=125320
https://registry.gbif-uat.org/collection/b802fafc-7c78-4b1c-82f5-a536e0ac059b
The editor even added a link to GBIF records.

These are only a few cases but it would be nice to not overwrite these entries. For now we should check the modified dates before synchronising and notify IH if the GrSciColl version is more up to date.

* [maven-release-plugin] prepare for next development iteration * Update gbif-doi to version 2.7 * [maven-release-plugin] prepare release registry-2.120 * [maven-release-plugin] prepare for next development iteration * Upgrade to API with fixed notification_addresses key. gbif/portal-feedback#2046 * Changes to endorsement email. Requested in gbif/portal-feedback#2126 * Omit repeated Download objects from DatasetOccurrenceDownloadUsage responses. Resolves #134. * Update download-query-tools to support huge downloads with many taxa. * Released versions. * [maven-release-plugin] prepare release registry-2.121 * [maven-release-plugin] prepare for next development iteration * Hack XML validation test to pass, avoiding redirect to HTTPS for the DC schema. * [maven-release-plugin] prepare release registry-2.122 * [maven-release-plugin] prepare for next development iteration * Update API version, for download predicate limits/changes. gbif/occurrence#50 * Always include ENDORSE link in new publisher emails. * added earthCape installation type * updated gbif-common-ws version * updated common-mybatis version * [maven-release-plugin] prepare release registry-2.123 * [maven-release-plugin] prepare for next development iteration * updated gbif-api version * updated gbif-api version * [maven-release-plugin] prepare release registry-2.124 * [maven-release-plugin] prepare for next development iteration * SQLDownloadRequest was replaced with SqlDownloadRequest * Implement search by installation type. * Allow editors to see their organization's shared token. Resolves #121. * Align DataCite metadata with citation guidelines. Resolves #137. * Allow dataset editors to edit default-term.gbif.org machine tags. Resolves 120. * Fix copy-paste error. * Allow deleting default-term machine tags. Resolves #120. * Add missing Liquibase change. * pipelines history tracking service migrated to the registry * tests pipeline process ws * adding a crawlall endpoint and supporting ther platform parameter * moving page size to constant * cleanup * Correction to test. * cleanup * added tests pipelines * javadoc * pipelinesModule changed not to install postal service * Check node permissions when setting endorsement. Resolves #140. * Released version. * removing datasetTilte from PipelineProcess + pipelines enums added to ws * updated gbif-api version * fixed enumeration resource test * pipelines history: added tests + small fixes * added metrics to pipelines history * added pipelines properties to test resource * index url for pipelines metrics * added log * changed metrics type handler not to store empty values in DB * fix metrics url * fixed url creation for pipelines metrics * last attempt throws exception if not found * not throwing exception when a crawl dataset fails * updated versions of gbif-api and postal-service * modified loops for crawl all and rerun all pipelines when dataset fails * added logs * cleanup * [maven-release-plugin] prepare release registry-2.125 * [maven-release-plugin] prepare for next development iteration * fix bug in rerun all pipelines steps * fix loop run and crawl all pipelines * crawAll and runAll pipelines executed async * less verbose logs * [maven-release-plugin] prepare release registry-2.126 * [maven-release-plugin] prepare for next development iteration * replaced insert with upsert to create pipelines history process to avoid concurrency issues when calling from crawler * [maven-release-plugin] prepare release registry-2.127 * [maven-release-plugin] prepare for next development iteration * handling TO_VERBATIM step by transform it into a for specific step for ABCD, DWCA and XML * using latest api that has the TO_VERBATIM step * [maven-release-plugin] prepare release registry-2.128 * [maven-release-plugin] prepare for next development iteration * Update postal-service to 0.38 * Fix compilation error in DatasetResource, StartCrawlMessage constructor parameters * Changed mybatis.version to the old (TIMESTAMP issue) * Improve DatasetProcessStatusIT * adapted dataset process status to new mybatis version * pipelines history ordered by created date * deffensive checks for ES metrics * [maven-release-plugin] prepare release registry-2.129 * [maven-release-plugin] prepare for next development iteration * #152 returning json response when steps are null * pipeline steps ordered in SQL query * updated gbif-api version * changed the check of input params in pipelines history * get DOI URL decoded for citations * Decoding DOI URL in citation * updated gbif-api version * [maven-release-plugin] prepare release registry-2.130 * [maven-release-plugin] prepare for next development iteration * #156 Refactor, fix geoLocation mapping part * Improve DatasetProcessStatusIT * #156 refactoring and geoLocation mapping * Reorganize classes in registry-doi * Replace DataCiteConverter with specific ones DownloadConverter or DatasetConverter * Fix DownloadConverter#truncateDescriptionDCM and tests * Refactor DatasetConverter * Refactor DatasetConverter * Improve DatasetConverterTest, add RegistryDoiUtils * Refactor DownloadConverter * Refactor DownloadConverterTest * Fix RegistryDoiUtilsTest date problem * Fix DatasetConverterTest and DownloadConverterTest date issue * CustomDownloadDataCiteConverter * Improve language mapping for DatasetConverter * [maven-release-plugin] prepare release registry-2.131 * [maven-release-plugin] prepare for next development iteration * fix bug when running all and crawling all datasets * [maven-release-plugin] prepare release registry-2.132 * [maven-release-plugin] prepare for next development iteration * added checks for empty metrics from ES in pipelines history * Fixed pipelines message order, monitoring and index prefix for doOnAll * added number of records to pipeline process + fix new steps * cleaned import * added number of records in pipeline process * added number of records in pipeline process * added log for number of records in pipeline process * Updated gbif-postal-service.version * updated gbif-api version * [maven-release-plugin] prepare release registry-2.133 * [maven-release-plugin] prepare for next development iteration * small refactor * Update README.md * Refactor NodeIT * Refactor NetworkEntityTest#testUpdate * Refactor NetworkEntityTest * Fix LenientAssert * Cleanup NodeResource * Reformat ws/security package * changed ES metrics type handler to avoid issues with unexpected values * run all pipelines and crawl all now include sampling event datasets too * [maven-release-plugin] prepare release registry-2.134 * [maven-release-plugin] prepare for next development iteration * [maven-release-plugin] prepare release registry-2.135 * [maven-release-plugin] prepare for next development iteration * [maven-release-plugin] prepare release registry-2.136 * [maven-release-plugin] prepare for next development iteration * [maven-release-plugin] prepare release registry-2.137 * [maven-release-plugin] prepare for next development iteration * fixed runAll and crawAll for pipelines * [maven-release-plugin] prepare release registry-2.138 * [maven-release-plugin] prepare for next development iteration * added PipelineProcessView to show a custom view in the registry-console * added checklist datasets to runAll and crawlAll + datasetTitle to process * added checklist datasets to runAll and crawlAll + datasetTitle to process * changed test DOIs * added checks for number of records in pipelines process * added checks for number of records in pipelines process * crawling all datasets since even some METADATA only datasets are associated to occurrence records * crawAll includes now all datasets * [maven-release-plugin] prepare release registry-2.139 * [maven-release-plugin] prepare for next development iteration * Reorder filters in RegistryWsServletListener, EditorFilter must be the last one * Add additional checks to EditorAuthorizationFilter * EditorAuthorizationFilter improve user is null case * EditorAuthorizationFilter improvements * EditorAuthorizationFilter change regex pattern in order to match the whole path * EditorAuthorizationFilter check methods return void * EditorAuthorizationFilter exclude endorsement and machine tags * added filter to exclude some datasets in crawAll and runAll pipelines * [maven-release-plugin] prepare release registry-2.140 * [maven-release-plugin] prepare for next development iteration * added workaround to ignore Optional values in pipelines history * revert workaround PipelinesAbdcMessage * updated gbif-postal-service version * pipelines history minor changes * updated postal-service version * updated postal-service version * [maven-release-plugin] prepare release registry-2.141 * [maven-release-plugin] prepare for next development iteration * ingestion service that merges crawl and pipelines history * ingestion service that merges crawl and pipelines history * ingestion service that merges crawl and pipelines history * removed MetricsHandler and added tests * test versions * test versions * fixed test * added remarks in PipelineProcessMapper.xml + fix tests * fix ingestion history when pipeline process doesn't exist * updated cloudera version * changes type of steps to run to be text instead of enum * minor improvements pipelines history * #159 Skeleton code for Index Herbariorum synchronization * improved response of run pipeline attempt + improved order of pipeline history * taking basicRecordsCountAttempted as number of records for verbatimToInterpreted step * updated versions to release (including cdh 5.12.0) * [maven-release-plugin] prepare release registry-2.142 * [maven-release-plugin] prepare for next development iteration * fix sorting in pipelines history * fix sorting in pipelines history * [maven-release-plugin] prepare release registry-2.143 * [maven-release-plugin] prepare for next development iteration * optimized method to get ingestion history to do less queries since this method is used very often by the UI * fix case when there is no dataset process statues in ingestion history * fix case when there is no dataset process statues in ingestion history * fix case when there is no dataset process statues in ingestion history * adapted classes for the http calls + entity converter + github client + extended grscicoll model * sync staff + refactor to make it easier to test * sync staff + refactor to make it easier to test * added tests * added tests * added cliSyncApp skeleton * removed lombok builders in entities used in WS because they need public constructor * CliSyncApp + tests * github issues assignees externalized to properties + fixes format diff file * added failed actions + improvements * fix test * added links to entities in GH issues + mapping IH countries to our enum * rollback test * mapping countries from IH to our enum + gh issues links + tests * improved country mapping * minor fixes * issues moved out from diff finder + issues for fails + using map for matches * config file for tests * changed config test * check for duplicate codes in grscicoll + added search by code and name * code unique * made GrSciColl entities machine taggable * adding identifiers manually to person in IH-sync * removed files pushed by mistake * removed check for duplicate codes + added numberSpecimens to collections * removed TODO Co-authored-by: GBIF Jenkins Bot <dev@gbif.org> Co-authored-by: Matt Blissett <matt@blissett.me.uk> Co-authored-by: Mikhail Podolskiy <mike.podolskiy90@gmail.com> Co-authored-by: Federico Mendez <federicomh@gmail.com> Co-authored-by: Nikolay Volik <nikolay.volik@hotmail.com> Co-authored-by: Tim Robertson <timrobertson100@gmail.com>

marcos-lg · 2020-04-29T13:42:09Z

In production and scheduled to run weekly.

timrobertson100 added a commit that referenced this issue Dec 11, 2019

#159 Skeleton code for Index Herbariorum synchronization

fde4f7e

timrobertson100 assigned mike-podolskiy90 Dec 11, 2019

marcos-lg assigned marcos-lg and unassigned mike-podolskiy90 Jan 6, 2020

marcos-lg pushed a commit that referenced this issue Jan 13, 2020

#159 Skeleton code for Index Herbariorum synchronization

96eaa0e

ManonGros mentioned this issue Jan 23, 2020

Synchronize with Index Herbariorum - Collections and institutions #167

Closed

marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Apr 29, 2020

marcos-lg closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize with Index Herbariorum #159

Synchronize with Index Herbariorum #159

timrobertson100 commented Dec 11, 2019

marcos-lg commented Jan 6, 2020

MortenHofft commented Jan 7, 2020

marcos-lg commented Jan 7, 2020

ManonGros commented Jan 7, 2020

marcos-lg commented Jan 7, 2020 •

edited

kcopas commented Jan 8, 2020 •

edited

timrobertson100 commented Jan 8, 2020 via email

ManonGros commented Jan 8, 2020

marcos-lg commented Apr 29, 2020

Synchronize with Index Herbariorum #159

Synchronize with Index Herbariorum #159

Comments

timrobertson100 commented Dec 11, 2019

marcos-lg commented Jan 6, 2020

MortenHofft commented Jan 7, 2020

marcos-lg commented Jan 7, 2020

ManonGros commented Jan 7, 2020

marcos-lg commented Jan 7, 2020 • edited

kcopas commented Jan 8, 2020 • edited

timrobertson100 commented Jan 8, 2020 via email

ManonGros commented Jan 8, 2020

marcos-lg commented Apr 29, 2020

marcos-lg commented Jan 7, 2020 •

edited

kcopas commented Jan 8, 2020 •

edited