Skip to content

Commit

Permalink
Collections catalogue (#168)
Browse files Browse the repository at this point in the history
* [maven-release-plugin] prepare for next development iteration

* Update gbif-doi to version 2.7

* [maven-release-plugin] prepare release registry-2.120

* [maven-release-plugin] prepare for next development iteration

* Upgrade to API with fixed notification_addresses key.

gbif/portal-feedback#2046

* Changes to endorsement email.

Requested in gbif/portal-feedback#2126

* Omit repeated Download objects from DatasetOccurrenceDownloadUsage responses.

Resolves #134.

* Update download-query-tools to support huge downloads with many taxa.

* Released versions.

* [maven-release-plugin] prepare release registry-2.121

* [maven-release-plugin] prepare for next development iteration

* Hack XML validation test to pass, avoiding redirect to HTTPS for the DC schema.

* [maven-release-plugin] prepare release registry-2.122

* [maven-release-plugin] prepare for next development iteration

* Update API version, for download predicate limits/changes.

gbif/occurrence#50

* Always include ENDORSE link in new publisher emails.

* added earthCape installation type

* updated gbif-common-ws version

* updated common-mybatis version

* [maven-release-plugin] prepare release registry-2.123

* [maven-release-plugin] prepare for next development iteration

* updated gbif-api version

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.124

* [maven-release-plugin] prepare for next development iteration

* SQLDownloadRequest was replaced with SqlDownloadRequest

* Implement search by installation type.

* Allow editors to see their organization's shared token.

Resolves #121.

* Align DataCite metadata with citation guidelines.

Resolves #137.

* Allow dataset editors to edit default-term.gbif.org machine tags.

Resolves 120.

* Fix copy-paste error.

* Allow deleting default-term machine tags.

Resolves #120.

* Add missing Liquibase change.

* pipelines history tracking service migrated to the registry

* tests pipeline process ws

* adding a crawlall endpoint and supporting ther platform parameter

* moving page size to constant

* cleanup

* Correction to test.

* cleanup

* added tests pipelines

* javadoc

* pipelinesModule changed not to install postal service

* Check node permissions when setting endorsement.

Resolves #140.

* Released version.

* removing datasetTilte from PipelineProcess + pipelines enums added to ws

* updated gbif-api version

* fixed enumeration resource test

* pipelines history: added tests + small fixes

* added metrics to pipelines history

* added pipelines properties to test resource

* index url for pipelines metrics

* added log

* changed metrics type handler not to store empty values in DB

* fix metrics url

* fixed url creation for pipelines metrics

* last attempt throws exception if not found

* not throwing exception when a crawl dataset fails

* updated versions of gbif-api and postal-service

* modified loops for crawl all and rerun all pipelines when dataset fails

* added logs

* cleanup

* [maven-release-plugin] prepare release registry-2.125

* [maven-release-plugin] prepare for next development iteration

* fix bug in rerun all pipelines steps

* fix loop run and crawl all pipelines

* crawAll and runAll pipelines executed async

* less verbose logs

* [maven-release-plugin] prepare release registry-2.126

* [maven-release-plugin] prepare for next development iteration

* replaced insert with upsert to create pipelines history process to avoid concurrency issues when calling from crawler

* [maven-release-plugin] prepare release registry-2.127

* [maven-release-plugin] prepare for next development iteration

* handling TO_VERBATIM step by transform it into a for specific step for ABCD, DWCA and XML

* using latest api that has the TO_VERBATIM step

* [maven-release-plugin] prepare release registry-2.128

* [maven-release-plugin] prepare for next development iteration

* Update postal-service to 0.38

* Fix compilation error in DatasetResource, StartCrawlMessage constructor parameters

* Changed mybatis.version to the old (TIMESTAMP issue)

* Improve DatasetProcessStatusIT

* adapted dataset process status to new mybatis version

* pipelines history ordered by created date

* deffensive checks for ES metrics

* [maven-release-plugin] prepare release registry-2.129

* [maven-release-plugin] prepare for next development iteration

* #152 returning json response when steps are null

* pipeline steps ordered in SQL query

* updated gbif-api version

* changed the check of input params in pipelines history

* get DOI URL decoded for citations

* Decoding DOI URL in citation

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.130

* [maven-release-plugin] prepare for next development iteration

* #156 Refactor, fix geoLocation mapping part

* Improve DatasetProcessStatusIT

* #156 refactoring and geoLocation mapping

* Reorganize classes in registry-doi

* Replace DataCiteConverter with specific ones DownloadConverter or DatasetConverter

* Fix DownloadConverter#truncateDescriptionDCM and tests

* Refactor DatasetConverter

* Refactor DatasetConverter

* Improve DatasetConverterTest, add RegistryDoiUtils

* Refactor DownloadConverter

* Refactor DownloadConverterTest

* Fix RegistryDoiUtilsTest date problem

* Fix DatasetConverterTest and DownloadConverterTest date issue

* CustomDownloadDataCiteConverter

* Improve language mapping for DatasetConverter

* [maven-release-plugin] prepare release registry-2.131

* [maven-release-plugin] prepare for next development iteration

* fix bug when running all and crawling all datasets

* [maven-release-plugin] prepare release registry-2.132

* [maven-release-plugin] prepare for next development iteration

* added checks for empty metrics from ES in pipelines history

* Fixed pipelines message order, monitoring and index prefix for doOnAll

* added number of records to pipeline process + fix new steps

* cleaned import

* added number of records in pipeline process

* added number of records in pipeline process

* added log for number of records in pipeline process

* Updated gbif-postal-service.version

* updated gbif-api version

* [maven-release-plugin] prepare release registry-2.133

* [maven-release-plugin] prepare for next development iteration

* small refactor

* Update README.md

* Refactor NodeIT

* Refactor NetworkEntityTest#testUpdate

* Refactor NetworkEntityTest

* Fix LenientAssert

* Cleanup NodeResource

* Reformat ws/security package

* changed ES metrics type handler to avoid issues with unexpected values

* run all pipelines and crawl all now include sampling event datasets too

* [maven-release-plugin] prepare release registry-2.134

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.135

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.136

* [maven-release-plugin] prepare for next development iteration

* [maven-release-plugin] prepare release registry-2.137

* [maven-release-plugin] prepare for next development iteration

* fixed runAll and crawAll for pipelines

* [maven-release-plugin] prepare release registry-2.138

* [maven-release-plugin] prepare for next development iteration

* added PipelineProcessView to show a custom view in the registry-console

* added checklist datasets to runAll and crawlAll + datasetTitle to process

* added checklist datasets to runAll and crawlAll + datasetTitle to process

* changed test DOIs

* added checks for number of records in pipelines process

* added checks for number of records in pipelines process

* crawling all datasets since even some METADATA only datasets are associated to occurrence records

* crawAll includes now all datasets

* [maven-release-plugin] prepare release registry-2.139

* [maven-release-plugin] prepare for next development iteration

* Reorder filters in RegistryWsServletListener, EditorFilter must be the last one

* Add additional checks to EditorAuthorizationFilter

* EditorAuthorizationFilter improve user is null case

* EditorAuthorizationFilter improvements

* EditorAuthorizationFilter change regex pattern in order to match the whole path

* EditorAuthorizationFilter check methods return void

* EditorAuthorizationFilter exclude endorsement and machine tags

* added filter to exclude some datasets in crawAll and runAll pipelines

* [maven-release-plugin] prepare release registry-2.140

* [maven-release-plugin] prepare for next development iteration

* added workaround to ignore Optional values in pipelines history

* revert workaround PipelinesAbdcMessage

* updated gbif-postal-service version

* pipelines history minor changes

* updated postal-service version

* updated postal-service version

* [maven-release-plugin] prepare release registry-2.141

* [maven-release-plugin] prepare for next development iteration

* ingestion service that merges crawl and pipelines history

* ingestion service that merges crawl and pipelines history

* ingestion service that merges crawl and pipelines history

* removed MetricsHandler and added tests

* test versions

* test versions

* fixed test

* added remarks in PipelineProcessMapper.xml + fix tests

* fix ingestion history when pipeline process doesn't exist

* updated cloudera version

* changes type of steps to run to be text instead of enum

* minor improvements pipelines history

* #159 Skeleton code for Index Herbariorum synchronization

* improved response of run pipeline attempt + improved order of pipeline history

* taking basicRecordsCountAttempted as number of records for verbatimToInterpreted step

* updated versions to release (including cdh 5.12.0)

* [maven-release-plugin] prepare release registry-2.142

* [maven-release-plugin] prepare for next development iteration

* fix sorting in pipelines history

* fix sorting in pipelines history

* [maven-release-plugin] prepare release registry-2.143

* [maven-release-plugin] prepare for next development iteration

* optimized method to get ingestion history to do less queries since this method is used very often by the UI

* fix case when there is no dataset process statues in ingestion history

* fix case when there is no dataset process statues in ingestion history

* fix case when there is no dataset process statues in ingestion history

* adapted classes for the http calls + entity converter + github client + extended grscicoll model

* sync staff + refactor to make it easier to test

* sync staff + refactor to make it easier to test

* added tests

* added tests

* added cliSyncApp skeleton

* removed lombok builders in entities used in WS because they need public constructor

* CliSyncApp + tests

* github issues assignees externalized to properties + fixes format diff file

* added failed actions + improvements

* fix test

* added links to entities in GH issues + mapping IH countries to our enum

* rollback test

* mapping countries from IH to our enum + gh issues links + tests

* improved country mapping

* minor fixes

* issues moved out from diff finder + issues for fails + using map for matches

* config file for tests

* changed config test

* check for duplicate codes in grscicoll + added search by code and name

* code unique

* made GrSciColl entities machine taggable

* adding identifiers manually to person in IH-sync

* removed files pushed by mistake

* removed check for duplicate codes + added numberSpecimens to collections

* removed TODO

Co-authored-by: GBIF Jenkins Bot <dev@gbif.org>
Co-authored-by: Matt Blissett <matt@blissett.me.uk>
Co-authored-by: Mikhail Podolskiy <mike.podolskiy90@gmail.com>
Co-authored-by: Federico Mendez <federicomh@gmail.com>
Co-authored-by: Nikolay Volik <nikolay.volik@hotmail.com>
Co-authored-by: Tim Robertson <timrobertson100@gmail.com>
  • Loading branch information
7 people committed Jan 24, 2020
1 parent ece587c commit ed714b4
Show file tree
Hide file tree
Showing 62 changed files with 5,376 additions and 577 deletions.
1 change: 1 addition & 0 deletions pom.xml
Expand Up @@ -60,6 +60,7 @@
<module>registry-surety</module>
<module>registry-ws</module>
<module>registry-ws-client</module>
<module>registry-collections-sync</module>
</modules>

<!--
Expand Down
3 changes: 3 additions & 0 deletions registry-collections-sync/README.md
@@ -0,0 +1,3 @@
# Registry Collections Synchronisation

Provides synchronisation utilities for key repositories such as Index Hebrarirum.
119 changes: 119 additions & 0 deletions registry-collections-sync/pom.xml
@@ -0,0 +1,119 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>org.gbif.registry</groupId>
<artifactId>registry-motherpom</artifactId>
<version>2.144-SNAPSHOT</version>
</parent>

<artifactId>registry-collections-sync</artifactId>
<packaging>jar</packaging>

<name>Registry Collections Sync</name>

<distributionManagement>
<site>
<id>gh-pages</id>
<url>http://gbif.github.io/registry/${project.artifactId}/</url>
</site>
</distributionManagement>

<properties>
<slf4j.version>1.7.29</slf4j.version>
<jackson.version>2.10.2</jackson.version>
<retrofit.version>2.7.1</retrofit.version>
<okhttp.version>4.3.1</okhttp.version>
<okio.version>2.4.3</okio.version>
<lombok.version>1.18.10</lombok.version>
<jcommander.version>1.78</jcommander.version>
</properties>

<repositories>
<repository>
<id>gbif-all</id>
<url>http://repository.gbif.org/content/groups/gbif</url>
</repository>
<repository>
<id>gbif-thirdparty</id>
<url>http://repository.gbif.org/content/repositories/thirdparty/</url>
</repository>
</repositories>

<dependencies>
<dependency>
<groupId>org.gbif</groupId>
<artifactId>gbif-api</artifactId>
<exclusions>
<exclusion>
<groupId>org.codehaus.jackson</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-yaml</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>com.squareup.retrofit2</groupId>
<artifactId>retrofit</artifactId>
<version>${retrofit.version}</version>
</dependency>
<dependency>
<groupId>com.squareup.retrofit2</groupId>
<artifactId>converter-jackson</artifactId>
<version>${retrofit.version}</version>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>${okhttp.version}</version>
</dependency>
<dependency>
<groupId>com.squareup.okio</groupId>
<artifactId>okio</artifactId>
<version>${okio.version}</version>
</dependency>
<dependency>
<groupId>commons-beanutils</groupId>
<artifactId>commons-beanutils</artifactId>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>${lombok.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
<version>${jcommander.version}</version>
</dependency>


<!-- test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>

</dependencies>
</project>
@@ -0,0 +1,107 @@
package org.gbif.registry.collections.sync;

import org.gbif.api.model.collections.Collection;
import org.gbif.api.model.collections.Institution;
import org.gbif.api.model.collections.Person;
import org.gbif.registry.collections.sync.diff.*;
import org.gbif.registry.collections.sync.grscicoll.GrSciCollHttpClient;
import org.gbif.registry.collections.sync.ih.IHHttpClient;
import org.gbif.registry.collections.sync.ih.IHInstitution;

import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.CompletableFuture;

import com.beust.jcommander.JCommander;
import com.beust.jcommander.Parameter;
import lombok.extern.slf4j.Slf4j;

import static org.gbif.registry.collections.sync.diff.DiffResult.FailedAction;

@Slf4j
public class CliSyncApp {

public static void main(String[] args) {
// parse args
CliArgs cliArgs = new CliArgs();
JCommander.newBuilder().addObject(cliArgs).build().parse(args);

SyncConfig config =
SyncConfig.fromFileName(cliArgs.confPath)
.orElseThrow(() -> new IllegalArgumentException("No valid config provided"));

// load the data from the WS
log.info("Loading IH");
IHHttpClient ihHttpClient = IHHttpClient.create(config.getIhWsUrl());
CompletableFuture<List<IHInstitution>> ihInstitutionsFuture =
CompletableFuture.supplyAsync(ihHttpClient::getInstitutions);

GrSciCollHttpClient grSciCollHttpClient = GrSciCollHttpClient.create(config);
log.info("Loading Institutions");
CompletableFuture<List<Institution>> institutionsFuture =
CompletableFuture.supplyAsync(grSciCollHttpClient::getInstitutions);

log.info("Loading Collections");
CompletableFuture<List<Collection>> collectionsFuture =
CompletableFuture.supplyAsync(grSciCollHttpClient::getCollections);

log.info("Loading Persons");
CompletableFuture<List<Person>> personsFuture =
CompletableFuture.supplyAsync(grSciCollHttpClient::getPersons);

CompletableFuture.allOf(
ihInstitutionsFuture, institutionsFuture, collectionsFuture, personsFuture)
.join();

List<IHInstitution> ihInstitutions = ihInstitutionsFuture.join();
List<Institution> institutions = institutionsFuture.join();
List<Collection> collections = collectionsFuture.join();
List<Person> persons = personsFuture.join();

// create an entity converter to use in the diff finder process
EntityConverter entityConverter =
EntityConverter.builder()
.countries(ihHttpClient.getCountries())
.creationUser(config.getRegistryWsUser())
.build();

// look for differences
log.info("Looking for differences");
DiffResult diffResult =
IndexHerbariorumDiffFinder.builder()
.ihInstitutions(ihInstitutions)
.ihStaffFetcher(ihHttpClient::getStaffByInstitution)
.institutions(institutions)
.collections(collections)
.persons(persons)
.entityConverter(entityConverter)
.build()
.find();

// handle results
List<FailedAction> fails =
DiffResultHandler.builder()
.diffResult(diffResult)
.config(config)
.grSciCollHttpClient(grSciCollHttpClient)
.build()
.handle();

// add fails to result
log.info("{} operations failed updating the registry", fails.size());
diffResult.setFailedActions(fails);

log.info("Diff result: {}", diffResult);

// save results to a file
if (config.isSaveResultsToFile()) {
DiffResultExporter.exportResultsToFile(
diffResult, Paths.get("ih_sync_result_" + System.currentTimeMillis()));
}
}

private static class CliArgs {
@Parameter(names = {"--config", "-c"})
private String confPath;
}
}
@@ -0,0 +1,100 @@
package org.gbif.registry.collections.sync;

import java.io.File;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.List;
import java.util.Optional;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.ObjectReader;
import com.fasterxml.jackson.dataformat.yaml.YAMLFactory;
import com.google.common.base.Strings;
import lombok.Getter;
import lombok.Setter;
import lombok.extern.slf4j.Slf4j;

@Getter
@Setter
@Slf4j
public class SyncConfig {

private static final ObjectMapper YAML_MAPPER = new ObjectMapper(new YAMLFactory());
private static final ObjectReader YAML_READER = YAML_MAPPER.readerFor(SyncConfig.class);

private String registryWsUrl;
private String registryWsUser;
private String registryWsPassword;
private String ihWsUrl;
private NotificationConfig notification;
private boolean saveResultsToFile;
private boolean dryRun;
private boolean sendNotifications;

@Getter
@Setter
public static class NotificationConfig {
private String githubWsUrl;
private String githubUser;
private String githubPassword;
private String ihPortalUrl;
private String registryPortalUrl;
private List<String> ghIssuesAssignees;
}

public static Optional<SyncConfig> fromFileName(String configFileName) {
if (Strings.isNullOrEmpty(configFileName)) {
log.error("No config file provided");
return Optional.empty();
}

File configFile = Paths.get(configFileName).toFile();
SyncConfig config;
try {
config = YAML_READER.readValue(configFile);
} catch (IOException e) {
log.error("Couldn't load config from file {}", configFileName, e);
return Optional.empty();
}

if (config == null) {
return Optional.empty();
}

// do some checks for required fields
if (Strings.isNullOrEmpty(config.getRegistryWsUrl())
|| Strings.isNullOrEmpty(config.getIhWsUrl())) {
throw new IllegalArgumentException("Registry and IH WS URLs are required");
}

if (!config.isDryRun()
&& (Strings.isNullOrEmpty(config.getRegistryWsUser())
|| Strings.isNullOrEmpty(config.getRegistryWsPassword()))) {
throw new IllegalArgumentException(
"Registry WS credentials are required if we are not doing a dry run");
}

if (config.isSendNotifications()) {
if (config.getNotification() == null) {
throw new IllegalArgumentException("Notification config is required");
}

if (!config.getNotification().getGithubWsUrl().endsWith("/")) {
throw new IllegalArgumentException("Github API URL must finish with a /.");
}

if (Strings.isNullOrEmpty(config.getNotification().getGithubUser())
|| Strings.isNullOrEmpty(config.getNotification().getGithubPassword())) {
throw new IllegalArgumentException(
"Github credentials are required if we are not ignoring conflicts.");
}

if (Strings.isNullOrEmpty(config.getNotification().getRegistryPortalUrl())
|| Strings.isNullOrEmpty(config.getNotification().getIhPortalUrl())) {
throw new IllegalArgumentException("Portal URLs are required");
}
}

return Optional.of(config);
}
}

0 comments on commit ed714b4

Please sign in to comment.