Skip to content

ElectronicBabylonianLiterature/ebl-api

Repository files navigation

Electronic Babylonian Library API

Build Status Test Coverage Maintainability Code style: black

Table of contents

Setup

Requirements:

pip install poetry
poetry install

If libcst installation fails (because a binary wheel is not available for Linux/Windows + Pypy) you may need to install the rust compiler to solve it.

The following are needed to run application:

Auth0

An API and Application have to be setup in Auth0 and it the API needs to have the Scopes listed below.

API Identifier, Application Domain (or the customdomain if one is used), and Application Signing Certificate are needed for the environment variables (see below). The whole certificate needs (everything in the field or the downloaded PEM file) has to be base64 encoded before being added to the environment variable.

Scopes

Corpus: write:texts, create:texts,

Fragmentarium: lemmatize:fragments, transliterate:fragments, annotate:fragments,

Bibliography: write:bibliography,

Dictionary: write:words,

Legacy (currently unused) scopes

access:beta, read:texts, read:fragments, read:bibliography, read:words,

Folio scopes have the following format:

read:<Folio name>-folios

Fragments have additional scopes in the following format:

read:<Fragment group>-fragments

Rules

"Add permissions to the user object" must bet set in Authorization Extension and the rule published. The following rules should be added to the Auth Pipeleine.

eBL name:

function (user, context, callback) {
  const namespace = 'https://ebabylon.org/';
  context.idToken[namespace + 'eblName'] = user.user_metadata.eblName;
  callback(null, user, context);
}

Access token scopes (must be after the Authorization Extension's rule):

function (user, context, callback) {
  const permissions = user.permissions || [];
  const requestedScopes = context.request.body.scope || context.request.query.scope;
  context.accessToken.scope = requestedScopes
    .split(' ')
    .filter(scope => scope.indexOf(':') < 0)
    .concat(permissions)
    .join(' ');

  callback(null, user, context);
}

Users

The users should eblName property in the user_metadata. E.g.:

{
  "eblName": "Surname"
}

Sentry

An organization and project need to be setup in Sentry. DSN under Client Keys is needed for the for the environment variables (see below).

Development

The project comes with a Gitpod configuration including select extensions and a local MongoDB. Click the button below, configure the environment variables (, import the data if you wish to use the local DB) and you are good to go.

Open in Gitpod

Running the tests

task format  # Format all files.
task format -- --check  # Check file formatting.
task lint  # Run linter.
task type  # Run type check
task test  # Run tests.  
task test -- -n auto  # Run tests in parallel.
task test -- --cov=ebl --cov-report term --cov-report xml  # Run tests with coverage (slow in PyPy).
task test-all  # Run format, lint and type checks, and tests.

See pytest-xdist documentation for more information on parallel tests. To avoid race condition when running the tests in parallel run poetry run python -m ebl.tests.downloader.

⚠️ Sometimes test results may differ for PyPy and non-PyPy Python (the latter is used for some automatic checks in this repository). If tests fail with non-PyPy Python alone, make sure to install and use the same Python version for debugging.

Custom Git Shortcut

task cp --- commit-message  # Runs black, flake8 and pyre-check and git add, commit and push

Codestyle

Use Black codestyle and PEP8 naming conventions. Line length is 88, and bugbear B950 is used instead of E501. PEP8 checks should be enabled in PyCharm, but E501, E203, and E231 should be disabled.

Use type hints in new code and add the to old code when making changes.

Package dependencies

  • Avoid directed package dependency cycles.
  • Domain packages should depend only on other domain packages.
  • Application packages should depend only on application and domain packages.
  • Web, infrastructure, etc. should depend only on application and domain packges.
  • All packages can depend on common modules in the top-level ebl package.

Dependencies can be analyzed with pydepgraph:

pydepgraph -p . -e tests -g 2 | dot -Tpng -o graph.png

Database

See dictionary-parser, proper-name-importer, fragmentarium-parser, and sign-list-parser about generating the initial data. There have been chanages to the database structure since the scripts were initally used and they most likely require updates to work with latest version of the API.

pull-db.sh script can be used to pull a database from an another MongoDB instance to your development MongoDB. It will use mongodump and mongorestore to get all data except changelog collection, and photos and folios buckets.

To make the use less tedious the scripts reads defaults from the following environment varaiables:

PULL_DB_DEFAULT_SOURCE_HOST=<source MongoDB host>
PULL_DB_DEFAULT_SOURCE_USER=<source MongoDB user>
PULL_DB_DEFAULT_SOURCE_PASSWORD=<source MongoDB password>

The test use pymongo_inmemory for tests. Depending on your OS it might be necessary to configure it in order to get the correct version of MongoDB. E.g. for Ubuntu add the following environment variables:

PYMONGOIM__MONGO_VERSION=4.4
PYMONGOIM__OPERATING_SYSTEM=ubuntu

Caching

Falcon-Caching middleware can be used for caching. See the documentation for more information. Configuration is read from CACHE_CONFIG environment variable.

CACHE_CONFIG='{"CACHE_TYPE": "simple"}' poetry run waitress-serve --port=8000 --call ebl.app:get_app

Falcon-Caching v1.0.1 does not cache media. text must be used.

@cache.cached(timeout=DEFAULT_TIMEOUT)
def on_get(self, req, resp):
    resp.text = ...

cache-control decorator can be used to add Cache-Control header to responses.

@cache_control(['public', 'max-age=600'])
def on_get(self, req, resp):
    ...

A method to control when the header is added can be passed as the second argument.

@cache_control(['public', 'max-age=600'], lambda req, resp: req.auth is None)
def on_get(self, req, resp):
    ...

Authentication and Authorization

Auth0 and falcon-auth are used for authentication and authorization.

An endpoint can be protected using the @falcon.before decorator three ways:

  • @falcon.before(require_scope, "your scope name here"): Simple check if the user is allowed to use the endpoint. Dynamic checks based on the fetched data is not possible.
  • @falcon.before(require_folio_scope): Dynamically checks if the user can read folios based on the folio name from the url
  • @falcon.before(require_fragment_read_scope): Dynamically checks if the user can read individual fragments by comparing the authorized_scopes from the fragment with the user scopes

For example:

import falcon
from ebl.users.web.require_scope import require_scope, require_fragment_read_scope

@falcon.before(require_fragment_read_scope)
def on_get(self, req, resp):
    ...

@falcon.before(require_scope, "write:texts")
def on_post(self, req, resp):
    ...

Running the application

The application reads the configuration from following environment variables:

AUTH0_AUDIENCE=<the Identifier from Auth0 API Settings>
AUTH0_ISSUER=<the Domain from Auth Application Setttings, or the custom domain from Branding>
AUTH0_PEM=<Signing Certificate (PEM) from the Auth0 Application Advanced Settings. The whole certificate needs to be base64 encoded again before adding to environment.>
MONGODB_URI=<MongoDB connection URI with database>
MONGODB_DB=<MongoDB database. Optional, authentication database will be used as default.>
EBL_AI_API=<AI API URL. If you do not have access to and do not need the AI API use a safe dummy value.>
SENTRY_DSN=<Sentry DSN>
SENTRY_ENVIRONMENT=<development or production>
CACHE_CONFIG=<Falcon-Caching configuration. Optional, Null backend will be used as default.>

Poetry does not support .env-files. The environment variables need to be configured in the shell, unless ran via Task. Alternatively and external program can be used to handle the file e.g. direnv or Set-PsEnv.

Locally

task start
# or
poetry run waitress-serve --port=8000 --call ebl.app:get_app

Docker image

Build and run the docker image:

docker build -t ebl/api .
docker run -p 8000:8000 --rm -it --env-file=FILE --name ebl-api ebl/api

If you need to run custom operations inside Docker you can start the shell:

docker run --rm -it --env-file=.env --name ebl-shell --mount type=bind,source="$(pwd)",target=/usr/src/ebl ebl/api bash

Docker Compose

Build the images:

docker-compose build

Run only the API:

docker-compose -f ./docker-compose-api-only.yml up

Run the full backend including the database and admin interface:

docker-compose up

⚠️ You must create a script to create the MongoDB user in ./docker-entrypoint-initdb.d/create-users.js before the the database is started for the first time.

db.createUser(
  {
    user: "ebl-api",
    pwd: "<password>",
    roles: [
       { role: "readWrite", db: "ebl" }
    ]
  }
)

In addition to the variables specified above, the following environment variables are needed:

MONGODB_URI=mongodb://ebl-api:<password>@mongo:27017/ebl`
MONGO_INITDB_ROOT_USERNAME=<Mongo root user>
MONGO_INITDB_ROOT_PASSWORD=<Mongo root user password>
MONGOEXPRESS_LOGIN=<Mongo Express login username>
MONGOEXPRESS_PASSWORD=<Mongo Express login password>

Updating data

Changes to the schemas or parsers can lead the data in the database to become obsolete. Below are instructions how to migrate Fragmentarium and Corpus to the latest state.

Fragmentarium

Improving the parser can lead to existing transliterations to become obsolete tokens or invalid. The signs are calculated when a fragment is saved, but if the sign list is updated the fragments are not automatically updated.

The ebl.fragmentarium.update_fragments module can be used to recreate transliteration and signs in all fragments. A list of invalid fragments is saved to invalid_fragments.tsv.

The script can be run locally:

poetry run python -m ebl.fragmentarium.update_fragments

, as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-updater ebl/api poetry run python -m ebl.fragmentarium.update_fragments

, or with docker-compose:

docker-compose -f ./docker-compose-updater.yml up

Corpus

The ebl.corpus.texts module can be used to save the texts with the latest schema. A list of invalid texts is saved to invalid_texts.tsv. The script saves the texts as is. Transliterations are not reparsed.

The script can be run locally:

poetry run python -m ebl.corpus.update_texts

, as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.corpus.update_texts

Alignment

The ebl.alignment.align_fragmentarium module can be used to save align all fragments in the Fragmentarium with the Corpus. The scripts accepts the following arguments:

-h, --help                     show this help message and exit
-s SKIP, --skip SKIP           Number of fragments to skip.
-l LIMIT, --limit LIMIT        Number of fragments to align.
--minScore MIN_SCORE           Minimum score to show in the results.
--maxLines MAX_LINES           Maximum size of fragment to align.
-o OUTPUT, --output OUTPUT     Filename for saving the results.
-w WORKERS, --workers WORKERS  Number of parallel workers.
-t, --threads                  Use threads instead of processes for workers.

The script can be run locally:

poetry run python -m ebl.alignment.align_fragmentarium

or as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.alignment.align_fragmentarium

Steps to update the production database

  1. Implement the new functionality.
  2. Implement fallback to handle old data, if the new model is incompatible.
  3. Test that fragments are updated correctly in the development database.
  4. Deploy to production.
  5. Run the migration script. Do not start the script until the deployment has been succesfully completed.
  6. Fix invalid fragments.
  7. Remove fallback logic.
  8. Deploy to production.

Importing .atf files

Importing and conversion of external .atf files which are encoded according to the oracc and c-ATF standards to the eBL-ATF standard.

To run use:

poetry run python -m ebl.atf_importer.application.atf_importer [-h] -i INPUT -g GLOSSARY -l LOGDIR [-a] [-s]

Command line options

  • -h shows help message and exits the script.

  • -i INPUT, --input INPUT : Path of the input directory (required).

  • -l LOGDIR, --logdir LOGDIR : Path of the log files directory (required).

  • -g GLOSSARY, --glossary GLOSSARY : Path to the glossary file (required).

  • -a AUTHOR, --author AUTHOR : Name of the author of the imported fragements. If not specified a name needs to be entered manually for every fragment (optional).

  • -s STYLE, --style STYLE : Specify import style by entering one of the following: (Oracc ATF|Oracc C-ATF|CDLI). If omitted defaulting to Oracc ATF (optional).

  • The importer always tries to import all .atf files from one given input -i folder. To every imported folder a glossary file must be specified via -g. The import style can be set via the -s option, which is not mandatory. You can also assign an author to all imported fragments which are processed in one run via the -a option. If -a is omitted the atf-importer will ask for an author for each imported fragment.

Example calls:

poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/input/" -l "ebl/atf_importer/logs/" -g  "ebl/atf_importer/glossary/akk-x-stdbab.glo" -a "atf_importer"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/input_cdli_atf/" -l "ebl/atf_importer/logs/" -g  "ebl/atf_importer/glossary/akk-x-stdbab.glo" -a "test" -s "CDLI"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/input_c_atf/" -l "ebl/atf_importer/logs/" -g  "ebl/atf_importer/glossary/akk-x-stdbab.glo" -a "test" -s "Oracc C-ATF"

Troubleshooting

If a fragment cannot be imported check the console output for errors. Also check the specified log folder (error_lines.txt,unparseable_lines_[fragment_file].txt, not_imported.txt) and see which lines could not be parsed. If lines are faulty, fix them manually and retry the import process. If tokes are not lemmatized correctly, check the log-file not_lemmatized.txt.

Acknowledgements

CSL-JSON schema is based on citation-style-language/schema Copyright (c) 2007-2018 Citation Style Language and contributors. Licensed under MIT License.

Releases

No releases published

Packages

No packages published

Languages