Skip to content

stheid/SetPOS

Repository files navigation

Setvalued Part of Speech Tagging

license

This package provide set-valued POS-taggers. The code relies on existing probabilistic taggers like CoreNLP and the TreeTagger. Additionally the code also provides two simple taggers. Information about the Baseline can be found in my thesis.

Overview

  • data contains data
  • examples and scripts contain usage files
  • setpos contains the implementation

Installation

Disclaimer: The code probably doesn't run without modifications on Windows. It should work on any standard Linux distribution.

Simple

  • Install Python package:

    $ pip install .

Complete

  • Download TreeTagger and place the binaries tree-tagger and train-tree-tagger in the setpos/tagger/treetagger folder. Make sure the executable flag is set. This code is tested with version 3.2.2.
  • Install java version 11 (for CoreNLP)
  • Install swig-3 (for hyperopt)
  • Install Python package:

    $ pip install .[extra]

Corenlp

The CoreNLP tagger is provided as a patched version. The patch and packed jar is in setpos/tagger/corenlp, the patch is applied to this version.

The Patch changes the following:
  • CoreNLP will write the posterior probability into debug files (needed for pos tagging)
  • Additional command line option for modifying the deterministic tag expansion [thesis, 5.5.3]

Data

Data stems from the Intergramm which in turn includes texts that originally stem from the ReN project and have been adapted to the Intergramm tagging guidelines. The corpus consists of historic Middle Lower German texts. The provided versions here have slight modifications like orthographic unification.

Usage

import logging

from sklearn.model_selection import LeaveOneGroupOut
import pandas as pd

from setpos.tagger import MostFrequentTag, CoreNLPTagger, TreeTagger
from setpos.data.split import load

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    toks, tags, groups = load()
    train, test = next(LeaveOneGroupOut().split(toks, tags, groups))

    clf = TreeTagger()
    clf.fit(toks[train], tags[train])
    result = pd.DataFrame([toks[test][:20, 1].tolist(), clf.setpredict(toks[test][:20])], index=['token', 'tag']).T

    print(result)
token                                                tag

0 stadtrecht {"FM": 1.0} 1 braunschweig {"NE": 0.946357, "NA": 0.025582, "ADJD": 0.011... 2 1227 {"OA": 0.5348, "XY": 0.458823} 3 blankline {"$.": 0.995565} 4 SWelich {"OA": 0.839456, "DIA": 0.087804, "ADJA": 0.03... 5 vo+eghet {"NA": 0.636112, "VVFIN.*": 0.182379, "NE": 0.... 6 enen {"DIART": 0.934728, "CARDA": 0.062113} 7 richte {"NA": 1.0} ...

Citation

@article{heid2020reliable,
    title={Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction},
    author={Stefan Heid and Marcel Wever and Eyke Hüllermeier},
    year={2020},
    eprint={2008.01377},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgement

I want to thank my supervisors and co-authors Marcel Wewer and Prof. Eyke Hüllermeier for the helpful feedback during the thesis