parapipeline

In development, all of this can change.

Parapipeline is a pipeline for POS tagging of texts in multiple languages, sentence alignment, word alignment, and transliteration.

Installation

Make sure you have following programs installed

Python 3.8 or later (not tested on Python 3.7 and earlier)
wget
All prerequisites for Hunalign
Polyglot for transliteration requires python-numpy libicu-dev. (apt-get python-numpy libicu-dev)
git-lfs

Run git lfs install. Run make to install necessary packages, compile taggers, aligners, download models, ...

The installation process is not thoroughly tested on various systems (it should work on Ubuntu 18.04), if you encounter an error it's likely caused by a missing prerequisite.

Usage

There are scripts tag, transliterate, align, wordalign and run

All scripts have the same arguments as run.

Input

All scripts expect line delimited sentences in utf-8 encoded files.

The name of these files is NAME_LANG[_ID][.ext], where NAME is arbitrary text not containing _, LANG is iso-639-3 language code, optional ID distinguished between more variants of the same text (e.g. different translations), .ext is also optional.

Inputs have to be case insensitive (if you have file NAME and Name, it will cause errors).

See files in examples folder for some example input files.

Output

run script outputs tagged texts in XML files.

And when possible also sentence and word alignment files in XML.

See examples/outputs for example output of this pipeline.

Sentence alignment

Aligned sentences are represented by link tag.

Attribute type denotes the number of sentences from source and target text.
Attribute xtargets is the alignment itself: 6 7;5 meaning sentences 6,7 from source are aligned with sentence 5 from target.

Word alignment

Each link represents an "aligned block" - aligned sentences.

Attribute xtargets contains is a space-separated list of alignments... 1:2;3:4 means that word 2 from sentence 1 in source text is aligned with word 4 in sentence 3 in the target text.

Help

usage: run.py [-h] [-o OUTPUT_DIR] N [N ...]

Run pipeline.

positional arguments:
  N                     List of files to be processed. Format: NAME_LANG[_ID][.ext], for example Hobbit_eng.txt

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Directory to which to write the output files.

Languages

Done	Language	Done	Language	Done	Language
✅	Afrikaans	✅	French	✅	Norwegian
✅	Albanian	✅	Georgian	✅	Polish
✅	Armenian	✅	German	✅	Portuguese
✅	Belarusian	✅	Hebrew	✅	Romanian
❌	Bosnian	✅	Hungarian	✅	Russian
✅	Bulgarian	✅	Italian	✅	Serbian
✅	Catalan	✅	Japanese	✅	Slovak
✅	Chinese	❌	Kashubian	✅	Slovenian
✅	Croatian	✅	Korean	✅	Spanish
✅	Czech	✅	Latvian	✅	Swedish
✅	Danish	✅	Lithuanian	✅	Turkish
✅	Dutch	❌	Lower Sorbian	✅	Ukrainian
✅	English	✅	Macedonian	✅	Upper Sorbian
✅	Estonian	✅	Modern Greek	❌	Yiddish
✅	Finnish	❌	Molise Slavic

Adding new languages

Edit .config/config.json, follow the structure of the other languages in the file to add a new one.
- For treetagger, .par file has to be in ./pipeline/taggers/treetagger/lib/.
- For UDPipe, .udpipe file has to be in ./pipeline/taggers/udpipe/models/. Note that this uses UDPipe version 1, UDPipe version 2 models will not work.

Upgrading

This section is about upgrading models

UDPipe

In order to update UDPipe models, change ./pipeline/taggers/Makefile, section models to download desired models (and extract them ...). Then change config/config.json so that each language which uses UDPipe points to correct filename.

Treetagger

Change ./pipeline/taggers/treetagger/Makefile to download version of treetagger you wish to use. You can also add scripts to download more models and so on.

Acknoledgements

UDPipe: Straka Milan, Hajič Jan, Straková Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016
Treetagger: Helmut Schmid (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK
Hunalign: D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005). Parallel corpora for medium density languages In Proceedings of the RANLP 2005, pages 590-596.
BTagger: https://github.com/agesmundo/BTagger
Georgian Treetagger model comes from here: http://corpus.leeds.ac.uk/serge/mocky/ka.par
Eflomal: https://github.com/robertostling/eflomal

License

CC BY-NC-SA

This work mainly depends on trained UDPipe models which are licesed under CC BY-NC-SA.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github/workflows		.github/workflows
config		config
examples		examples
pipeline		pipeline
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
align.py		align.py
run.py		run.py
tag.py		tag.py
transliterate.py		transliterate.py
wordalign.py		wordalign.py

License

pixelneo/parapipeline

Folders and files

Latest commit

History

Repository files navigation

parapipeline

Installation

Usage

Input

Output

Sentence alignment

Word alignment

Help

Languages

Adding new languages

Upgrading

UDPipe

Treetagger

Acknoledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages