Skip to content

Learning-from-our-past/kaira-core

Repository files navigation

alt text Kaira-core

Main module containing logic for data extraction and command line interface.

Dependencies

  • Python 3

Setup

Nix

If you use Nix, then you can install most dependencies easily with nix-direnv. Then you just need to do the venv/pip installation steps below.

No Nix

The codebase has been formatted with black and reformatted for compliance with PEP8. The reformattings resulted in two commits that changed a lot of lines, which in turn can make it unnecessarily challenging to use git blame (and blame integration in IDEs) to peek into the history of the project. However, there is a way around this challenge: the hashes of the reformatting commits are in .git-blame-ignore-revs. To configure git to use that file when using git blame: git config blame.ignoreRevsFile .git-blame-ignore-revs.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp scripts/pre-commit .git/hooks

If you wish to chunk the html files with duplicate filtering, you will also need ssdeep. Installation of ssdeep is done through pip, but you also need to install ssdeep on your system, which can be done with apt:

sudo apt-get install ssdeep libfuzzy-dev libffi-dev python3-dev

More on ssdeep installation can be found here

If you need to generate the XML files with the CoNLLU/NLP data, you will need to perform the nlp-setup step: NB: NLP setup is very outdated as of 2022-08. It is due to be redone/updated. This notice will be removed when it is.

inv nlp-setup  # NOTE: you need to have Java (eg. openjdk) installed for this to work

Note that ssdeep pip-package seems to be difficult to install on MacOS since it was tested only on Linux systems according to their documentation. Ignore the dependency on MacOS and install other packages from requirements.txt. Everything else than chunking and duplicating code will work and affected tests are skipped when ssdeep is not available.

Attribution

Please cite if you use this software or datasets generated by it in your research:

T. Salmi, L. Kallioniemi, J. Loehr. Kaira-core [computer software]. Lammi Biological Station 2022 Available at https://github.com/Tumetsu/Kaira

About

Data extraction from Finnish person catalogues

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages