Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize with Index Herbariorum #159

Closed
timrobertson100 opened this issue Dec 11, 2019 · 9 comments
Closed

Synchronize with Index Herbariorum #159

timrobertson100 opened this issue Dec 11, 2019 · 9 comments
Assignees
Labels
GRSciColl Issues related to institutions, collections and staff

Comments

@timrobertson100
Copy link
Member

Index Herbariorum is an authoritative catalog which should be the master source for Herbaria entities. Herbaria records in the registry should be kept in sync with the ongoing editing efforts of IH.

This first iteration of work is deliberately scoped to accommodate the minimal functionality needed to achieve this. Once complete, additional feature requests can be opened as new issues.

It is envisaged the general synchronization will operate as follows:

  • Retrieve all Herbaria from IndexHerbariorum
  • For each entity locate the equivalent Institution or Collection in GRSciColl using the IH IRN
    • If the entity exists and they differ, update GrSciColl
    • If the entity does not exist, insert it as an institution and with an identifier holding the IH IRN
    • If there is a conflict (e.g. multiple options) notify editors for resolution
  • Create, update or delete the associate staff members for the entities

A future version may allow the editing of IH entities in GRSciColl. Under that scenario when entities differ more complex logic is required, likely requiring notification to GRSciColl and IH staff to resolve the differences.

@marcos-lg
Copy link
Contributor

The institutions from IH have a contact field which has only a phone, an email and a webUrl (http://sweetgum.nybg.org/science/api/v1/institutions/UARK). In a GrSciColl institution/collection we have a contacts field but they are actually person entities (http://api.gbif.org/v1/grscicoll/institution/f7068d69-cf88-42d8-a984-0c4de6ce8579 whose contact is http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af ).

What should we do with the IH contact? ignore it? add it as a GrSciColl person and link it to the institution/collection? for the latter the first name is required, so in that case we need to make up one.

I don't know who can answer this best @timrobertson100 @MortenHofft @ManonGros

@MortenHofft
Copy link
Member

Those contact fields are not for a person. They are for the herbarium as an entity. So it is important as people come and go. I am quite sure that this would be considered essential from an IH standpoint and my feeling is that it is important as well. So I would suggest we extended our model instead. But better check with others as well.

As for people/staff. IH has an endpoint for those as well. They are - as far as I know - only linked by institution codes. In time we should sync those as well. But we might want to discuss more on our goal for handling contacts of this sort (ORCiD etc). @ManonGros do you have a preferred approach for this?

@marcos-lg
Copy link
Contributor

I like the idea of extending our model.

@ManonGros
Copy link
Contributor

For the herbarium contacts, I agree with you @MortenHofft , we should extend our model to have something like what we have for the GBIF publishing organisations (see for example "email":["geir@natsam.no"],"phone":["+47 99642071"] in http://api.gbif.org/v1/organization/b670ea7c-48e7-45e4-ba66-5bf01ee4d398).

For people/staff, I also agree, we should synchronise/import the people as well. Perhaps even before we synchronise the institutions? (I am just asking because it would seem logical to update the contacts when synchronising the IH institutions but this would require to have to staff/people up to date).

As far as I understand, for us staff/people can have a primary institution but be affiliated with several collections and institutions. While for IH, one person is associated with one institution code. Plus the information is a bit different (http://api.gbif.org/v1/grscicoll/person/118b48f0-9af9-45ac-8ea9-d8221d7fa2af and http://sweetgum.nybg.org/science/ih/person-details/?irn=131429).

For synchronising people/staff, should we proceed as we do for the institution? Meaning, checking matching semi-automatically first. If yes, how could we link them? There is no identifier or machine tags for people. Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one. And even for those who have one, we need to find them first.

I don't know if it is possible at all, but ideally I imagine something like that:

  1. Find potential ORCiD for all the GrSciColl staff/people (if we have confirmation that the ORCiD is correct for a given person, synchronise with this in priority)
  2. Match and link the IH person list with the GrSciColl staff/people
  3. Update the GrSciColl staff entries if older than IH
  4. Synchronise the GrSciColl institutions with IH (based on the identifiers we use to link them after our matching/checking, e.g what we did in UAT)

I know it is not that simple, let me know what you think.

@marcos-lg
Copy link
Contributor

marcos-lg commented Jan 7, 2020

About the staff is already in the description of this task so I was planning to sync them in this process. I don't think we need to do something manually.

EDIT: when I said I don't think we need to do something manually, I meant I will try to match them using the name, email or any other representative field (I did something similar in the last DB migration, even though the matching is not perfect because there are a lot of staff duplicated but just with different address or phone) and if I can't match to any existing one I will create a new one. Still this matching won't be perfect as I mentioned before, if we want it to be more accurate then we need some manual editing.

@kcopas
Copy link
Member

kcopas commented Jan 8, 2020

Plus as Morten suggested, we could use the ORCiDs when available but I doubt that most people have created one.

As of Dec 2017, there were 454,000 users in the biological sciences who have created ORCID IDs—one of the three highest adoption rates of any discipline (see Study of ORCID Adoption Across Disciplines and Locations). Tbh, we should commit to this, use the existing infrastructure (including becoming an ORCID member, imo) and encourage members of the community to sign up—the promise being that we can provide value-for-service if they do.

Note that Bloodhound is already using ORCIDs to pull both past and present institutional affiliation, e.g. https://bloodhound-tracker.net/organization/Q1122595. You all will know better how that works, but we could also consider this as (part of?) our approach…

@timrobertson100
Copy link
Member Author

timrobertson100 commented Jan 8, 2020 via email

@ManonGros
Copy link
Contributor

Something else to take into account for the synchronisation:
On the long term, we want IH records to be directly edited in IH and then synchronised with GrSciColl.
But right now, we have a handful of editors who have been editing their GrSciColl records already. Which means that GrSciColl contains the most updated information about a collection/institution not IH.
See this example:

These are only a few cases but it would be nice to not overwrite these entries. For now we should check the modified dates before synchronising and notify IH if the GrSciColl version is more up to date.

marcos-lg added a commit that referenced this issue Jan 24, 2020
* [maven-release-plugin] prepare for next development iteration

* Update gbif-doi to version 2.7

* [maven-release-plugin] prepare release registry-2.120

* [maven-release-plugin] prepare for next development iteration

* Upgrade to API with fixed notification_addresses key.

gbif/portal-feedback#2046

* Changes to endorsement email.

Requested in gbif/portal-feedback#2126

* Omit repeated Download objects from DatasetOccurrenceDownloadUsage responses.

Resolves #134.

* Update download-query-tools to support huge downloads with many taxa.

* Released versions.

* [maven-release-plugin] prepare release registry-2.121

* [maven-release-plugin] prepare for next development iteration

* Hack XML validation test to pass, avoiding redirect to HTTPS for the DC schema.

* [maven-release-plugin] prepare release registry-2.122

* [maven-release-plugin] prepare for next development iteration

* Update API version, for download predicate limits/changes.

gbif/occurrence#50

* Always include ENDORSE link in new publisher emails.

* added earthCape installation type

* updated gbif-common-ws version

* updated common-mybatis version

* [maven-release-plugin] prepare release registry-2.123

* [maven-release-plugin] prepare for next development iteration

* updated gbif-api version

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.124

* [maven-release-plugin] prepare for next development iteration

* SQLDownloadRequest was replaced with SqlDownloadRequest

* Implement search by installation type.

* Allow editors to see their organization's shared token.

Resolves #121.

* Align DataCite metadata with citation guidelines.

Resolves #137.

* Allow dataset editors to edit default-term.gbif.org machine tags.

Resolves 120.

* Fix copy-paste error.

* Allow deleting default-term machine tags.

Resolves #120.

* Add missing Liquibase change.

* pipelines history tracking service migrated to the registry

* tests pipeline process ws

* adding a crawlall endpoint and supporting ther platform parameter

* moving page size to constant

* cleanup

* Correction to test.

* cleanup

* added tests pipelines

* javadoc

* pipelinesModule changed not to install postal service

* Check node permissions when setting endorsement.

Resolves #140.

* Released version.

* removing datasetTilte from PipelineProcess + pipelines enums added to ws

* updated gbif-api version

* fixed enumeration resource test

* pipelines history: added tests + small fixes

* added metrics to pipelines history

* added pipelines properties to test resource

* index url for pipelines metrics

* added log

* changed metrics type handler not to store empty values in DB

* fix metrics url

* fixed url creation for pipelines metrics

* last attempt throws exception if not found

* not throwing exception when a crawl dataset fails

* updated versions of gbif-api and postal-service

* modified loops for crawl all and rerun all pipelines when dataset fails

* added logs

* cleanup

* [maven-release-plugin] prepare release registry-2.125

* [maven-release-plugin] prepare for next development iteration

* fix bug in rerun all pipelines steps

* fix loop run and crawl all pipelines

* crawAll and runAll pipelines executed async

* less verbose logs

* [maven-release-plugin] prepare release registry-2.126

* [maven-release-plugin] prepare for next development iteration

* replaced insert with upsert to create pipelines history process to avoid concurrency issues when calling from crawler

* [maven-release-plugin] prepare release registry-2.127

* [maven-release-plugin] prepare for next development iteration

* handling TO_VERBATIM step by transform it into a for specific step for ABCD, DWCA and XML

* using latest api that has the TO_VERBATIM step

* [maven-release-plugin] prepare release registry-2.128

* [maven-release-plugin] prepare for next development iteration

* Update postal-service to 0.38

* Fix compilation error in DatasetResource, StartCrawlMessage constructor parameters

* Changed mybatis.version to the old (TIMESTAMP issue)

* Improve DatasetProcessStatusIT

* adapted dataset process status to new mybatis version

* pipelines history ordered by created date

* deffensive checks for ES metrics

* [maven-release-plugin] prepare release registry-2.129

* [maven-release-plugin] prepare for next development iteration

* #152 returning json response when steps are null

* pipeline steps ordered in SQL query

* updated gbif-api version

* changed the check of input params in pipelines history

* get DOI URL decoded for citations

* Decoding DOI URL in citation

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.130

* [maven-release-plugin] prepare for next development iteration

* #156 Refactor, fix geoLocation mapping part

* Improve DatasetProcessStatusIT

* #156 refactoring and geoLocation mapping

* Reorganize classes in registry-doi

* Replace DataCiteConverter with specific ones DownloadConverter or DatasetConverter

* Fix DownloadConverter#truncateDescriptionDCM and tests

* Refactor DatasetConverter

* Refactor DatasetConverter

* Improve DatasetConverterTest, add RegistryDoiUtils

* Refactor DownloadConverter

* Refactor DownloadConverterTest

* Fix RegistryDoiUtilsTest date problem

* Fix DatasetConverterTest and DownloadConverterTest date issue

* CustomDownloadDataCiteConverter

* Improve language mapping for DatasetConverter

* [maven-release-plugin] prepare release registry-2.131

* [maven-release-plugin] prepare for next development iteration

* fix bug when running all and crawling all datasets

* [maven-release-plugin] prepare release registry-2.132

* [maven-release-plugin] prepare for next development iteration

* added checks for empty metrics from ES in pipelines history

* Fixed pipelines message order, monitoring and index prefix for doOnAll

* added number of records to pipeline process + fix new steps

* cleaned import

* added number of records in pipeline process

* added number of records in pipeline process

* added log for number of records in pipeline process

* Updated gbif-postal-service.version

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.133

* [maven-release-plugin] prepare for next development iteration

* small refactor

* Update README.md

* Refactor NodeIT

* Refactor NetworkEntityTest#testUpdate

* Refactor NetworkEntityTest

* Fix LenientAssert

* Cleanup NodeResource

* Reformat ws/security package

* changed ES metrics type handler to avoid issues with unexpected values

* run all pipelines and crawl all now include sampling event datasets too

* [maven-release-plugin] prepare release registry-2.134

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.135

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.136

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.137

* [maven-release-plugin] prepare for next development iteration

* fixed runAll and crawAll for pipelines

* [maven-release-plugin] prepare release registry-2.138

* [maven-release-plugin] prepare for next development iteration

* added PipelineProcessView to show a custom view in the registry-console

* added checklist datasets to runAll and crawlAll + datasetTitle to process

* added checklist datasets to runAll and crawlAll + datasetTitle to process

* changed test DOIs

* added checks for number of records in pipelines process

* added checks for number of records in pipelines process

* crawling all datasets since even some METADATA only datasets are associated to occurrence records

* crawAll includes now all datasets

* [maven-release-plugin] prepare release registry-2.139

* [maven-release-plugin] prepare for next development iteration

* Reorder filters in RegistryWsServletListener, EditorFilter must be the last one

* Add additional checks to EditorAuthorizationFilter

* EditorAuthorizationFilter improve user is null case

* EditorAuthorizationFilter improvements

* EditorAuthorizationFilter change regex pattern in order to match the whole path

* EditorAuthorizationFilter check methods return void

* EditorAuthorizationFilter exclude endorsement and machine tags

* added filter to exclude some datasets in crawAll and runAll pipelines

* [maven-release-plugin] prepare release registry-2.140

* [maven-release-plugin] prepare for next development iteration

* added workaround to ignore Optional values in pipelines history

* revert workaround PipelinesAbdcMessage

* updated gbif-postal-service version

* pipelines history minor changes

* updated postal-service version

* updated postal-service version

* [maven-release-plugin] prepare release registry-2.141

* [maven-release-plugin] prepare for next development iteration

* ingestion service that merges crawl and pipelines history

* ingestion service that merges crawl and pipelines history

* ingestion service that merges crawl and pipelines history

* removed MetricsHandler and added tests

* test versions

* test versions

* fixed test

* added remarks in PipelineProcessMapper.xml + fix tests

* fix ingestion history when pipeline process doesn't exist

* updated cloudera version

* changes type of steps to run to be text instead of enum

* minor improvements pipelines history

* #159 Skeleton code for Index Herbariorum synchronization

* improved response of run pipeline attempt + improved order of pipeline history

* taking basicRecordsCountAttempted as number of records for verbatimToInterpreted step

* updated versions to release (including cdh 5.12.0)

* [maven-release-plugin] prepare release registry-2.142

* [maven-release-plugin] prepare for next development iteration

* fix sorting in pipelines history

* fix sorting in pipelines history

* [maven-release-plugin] prepare release registry-2.143

* [maven-release-plugin] prepare for next development iteration

* optimized method to get ingestion history to do less queries since this method is used very often by the UI

* fix case when there is no dataset process statues in ingestion history

* fix case when there is no dataset process statues in ingestion history

* fix case when there is no dataset process statues in ingestion history

* adapted classes for the http calls + entity converter + github client + extended grscicoll model

* sync staff + refactor to make it easier to test

* sync staff + refactor to make it easier to test

* added tests

* added tests

* added cliSyncApp skeleton

* removed lombok builders in entities used in WS because they need public constructor

* CliSyncApp + tests

* github issues assignees externalized to properties + fixes format diff file

* added failed actions + improvements

* fix test

* added links to entities in GH issues + mapping IH countries to our enum

* rollback test

* mapping countries from IH to our enum + gh issues links + tests

* improved country mapping

* minor fixes

* issues moved out from diff finder + issues for fails + using map for matches

* config file for tests

* changed config test

* check for duplicate codes in grscicoll + added search by code and name

* code unique

* made GrSciColl entities machine taggable

* adding identifiers manually to person in IH-sync

* removed files pushed by mistake

* removed check for duplicate codes + added numberSpecimens to collections

* removed TODO

Co-authored-by: GBIF Jenkins Bot <dev@gbif.org>
Co-authored-by: Matt Blissett <matt@blissett.me.uk>
Co-authored-by: Mikhail Podolskiy <mike.podolskiy90@gmail.com>
Co-authored-by: Federico Mendez <federicomh@gmail.com>
Co-authored-by: Nikolay Volik <nikolay.volik@hotmail.com>
Co-authored-by: Tim Robertson <timrobertson100@gmail.com>
@marcos-lg marcos-lg added the GRSciColl Issues related to institutions, collections and staff label Apr 29, 2020
@marcos-lg
Copy link
Contributor

In production and scheduled to run weekly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GRSciColl Issues related to institutions, collections and staff
Projects
None yet
Development

No branches or pull requests

6 participants