Query API service for Event Data. Ingests from another source (e.g. another Query API, Event Bus or archive) and allows a number of queries to be run. Backed by ElasticSearch.
This codebase is used internally in Crossref Event Data, but you can easily run it to replicate Event Data or a subset of it, to your own database.
Provided as a Docker image for deployment. Docker Compose is used for testing.
from-occurred-date
- as YYYY-MM-DDuntil-occurred-date
- as YYYY-MM-DDfrom-collected-date
- as YYYY-MM-DDuntil-collected-date
- as YYYY-MM-DDsubj-id
- quoted URL or a DOIobj-id
- quoted URL or a DOIsubj-id.prefix
- DOI prefix like 10.5555obj-id.prefix
- DOI prefix like 10.5555subj-id.domain
- domain of the subj_id e.g. en.wikipedia.orgobj-id.domain
- domain of the subj_id e.g. en.wikipedia.orgsubj.url
- quoted full URLobj.url
- quoted full URLsubj.url.domain
- domain of the optional subj.url, if present e.g. en.wikipedia.orgobj.url.domain
- domain of the optional obj.url, if present e.g. en.wikipedia.orgsubj.alternative-id
- optional subj.alternative-idobj.alternative-id
- optional obj.alternative-idrelation
- relation type IDsource
- source ID
Anyone can run this as a replica, against the Crossref Query API, or against ananother replica.
lein run server
- run the serverlein run replicate-continuous
- run automatic continuous replication from another Query API instance, from now onwardlein run replicate-backfill-days «days»
- backfill from a number of days in the pastlein run add-indexes
- one off, ensure that all indexes are present
If you want to run a standard setup you should run server
and replicate
, which will each run and keep running. The first time you run replicate-continuous
, or if there has been an outage, you should run replicate-backfill-days
to catch up.
Replication mode makes two types of queries to the upstream Query API: one for getting newly occurring Events, and one for getting newly updated Events (which may have originally occurred at any point in time). These queries are supplied in the form of a templated URL. If you want to update all of the available data you can leave the defaults. If you only want to replicate a subset of the data, e.g. for a given prefix or source id, you can supply a custom URL with a filter.
Replication occurs according to an internal schedule at 5am every day, UTC.
The default values are (noting the %1$s
string substitution)
REPLICA_COLLECTED_URL=https://query.eventdata.crossref.org/events?filter=from-collected-date:%1$s&cursor=%2$s&rows=10000
REPLICA_UPDATED_URL=https://query.eventdata.crossref.org/events?from-updated-date:%1$s&cursor=%2$s&rows=10000
The following methods are only for Crossref internal use as they depend on access-controlled internal resources.
lein run server
- run the server.lein run queue-continuous
- run automatic continuous replication via a Kafka Queuelein run bus-backfill-days «days»
- backfill from a number of days in the past from the Event Bus archivelein run bus-backfill-days-from «date» «days»
- backfill from a number of days in the past from the Event Bus archive from and including the given datelein run add-indexes
- one off, ensure that all indexes are present
If the mappings (i.e. fields that ElasticSearch indexes) change, you need to run:
lein run update-mappings
- run the server
Because we may recieve data for more sources than we wish to store, the whitelist can be provided. This should be the name of a Crossref Artifact, e.g. crossref-sourcelist
. The whitelist is applied on ingestion, so data must be backfilled if it was discarded due to a previous value.
Server
docker-compose -f docker-compose.yml run -w /usr/src/app --service-ports test lein run server
REPL
docker-compose -f docker-compose.yml run -w /usr/src/app --service-ports test /bin/bash -c "stty sane && lein repl"
To run tests
docker-compose -f docker-compose.yml run -w /usr/src/app test lein test
In all cases:
Environment variable | Description |
---|---|
QUERY_DEPLOYMENT |
Optional. Prefix for ElasticSearch indexes to allow for multiple index instances per ES cluster. |
QUERY_ELASTIC_URI |
Connection URI for ElasticSearch e.g. http://127.0.0.1:9200 |
Running server:
Environment variable | Description |
---|---|
QUERY_PORT |
Port to listen on |
Running as a replica:
Environment variable | Description |
---|---|
QUERY_REPLICA_COLLECTED_URL |
Templated URL, described above. %1$s is start collection date, %2$ is cursor. Optional with default. |
QUERY_REPLICA_UPDATED_URL |
Templated URL, described above. %1$s is start update date, %2$ is cursor. Optional with default. |
Running within Crossref:
Environment variable | Description |
---|---|
QUERY_WHITELIST_ARTIFACT_NAME |
Name of Artifact used for source whitelist. Optional. |
QUERY_PREFIX_WHITELIST_ARTIFACT_NAME |
Name of Artifact used for DOI prefix whitelist. Optional. |
QUERY_EVENT_BUS_BASE |
Event Bus URL base for re-fill. Optional. |
GLOBAL_ARTIFACT_URL_BASE |
Public URL of Artifact registry. Optional. |
QUERY_JWT |
JWT Token for authenticating with Bus. Optional. |
QUERY_TERMS_URL |
A Terms URL to be associated with each event. Optional. |
GLOBAL_KAFKA_BOOTSTRAP_SERVERS |
Kafka servers |
GLOBAL_BUS_OUTPUT_TOPIC |
Topic to look for Events coming out of the Bus. |
A heartbeat URL is exposed at /heartbeat/recent
, which takes the optional query parameter since-ms-ago
, or uses a default value. It queries for Events with timestamps since a given period of time ago, specified in milliseconds. This will return 200 if there is more than one Event in the time range, or 404 if there are no Events. It's a simple way to check that the Query API is continually ingesting new data.
Copyright © 2017 Crossref
Distributed under the The MIT License (MIT).