Setvalued Part of Speech Tagging

This package provide set-valued POS-taggers. The code relies on existing probabilistic taggers like CoreNLP and the TreeTagger. Additionally the code also provides two simple taggers. Information about the Baseline can be found in my thesis.

Overview

data contains data
examples and scripts contain usage files
setpos contains the implementation

Installation

Disclaimer: The code probably doesn't run without modifications on Windows. It should work on any standard Linux distribution.

Simple

Install Python package:
```
$ pip install .
```

Complete

Download TreeTagger and place the binaries tree-tagger and train-tree-tagger in the setpos/tagger/treetagger folder. Make sure the executable flag is set. This code is tested with version 3.2.2.
Install java version 11 (for CoreNLP)
Install swig-3 (for hyperopt)
Install Python package:
```
$ pip install .[extra]
```

Corenlp

The CoreNLP tagger is provided as a patched version. The patch and packed jar is in setpos/tagger/corenlp, the patch is applied to this version.

The Patch changes the following:

CoreNLP will write the posterior probability into debug files (needed for pos tagging)
Additional command line option for modifying the deterministic tag expansion [thesis, 5.5.3]

Data

Data stems from the Intergramm which in turn includes texts that originally stem from the ReN project and have been adapted to the Intergramm tagging guidelines. The corpus consists of historic Middle Lower German texts. The provided versions here have slight modifications like orthographic unification.

Usage

import logging

from sklearn.model_selection import LeaveOneGroupOut
import pandas as pd

from setpos.tagger import MostFrequentTag, CoreNLPTagger, TreeTagger
from setpos.data.split import load

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    toks, tags, groups = load()
    train, test = next(LeaveOneGroupOut().split(toks, tags, groups))

    clf = TreeTagger()
    clf.fit(toks[train], tags[train])
    result = pd.DataFrame([toks[test][:20, 1].tolist(), clf.setpredict(toks[test][:20])], index=['token', 'tag']).T

    print(result)

token                                                tag

0 stadtrecht {"FM": 1.0} 1 braunschweig {"NE": 0.946357, "NA": 0.025582, "ADJD": 0.011... 2 1227 {"OA": 0.5348, "XY": 0.458823} 3 blankline {"$.": 0.995565} 4 SWelich {"OA": 0.839456, "DIA": 0.087804, "ADJA": 0.03... 5 vo+eghet {"NA": 0.636112, "VVFIN.*": 0.182379, "NE": 0.... 6 enen {"DIART": 0.934728, "CARDA": 0.062113} 7 richte {"NA": 1.0} ...

Citation

@article{heid2020reliable,
    title={Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction},
    author={Stefan Heid and Marcel Wever and Eyke Hüllermeier},
    year={2020},
    eprint={2008.01377},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgement

I want to thank my supervisors and co-authors Marcel Wewer and Prof. Eyke Hüllermeier for the helpful feedback during the thesis

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
examples		examples
scripts		scripts
setpos		setpos
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
path.conf		path.conf
requirements.txt		requirements.txt
requirements_extra.txt		requirements_extra.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

examples

examples

scripts

scripts

setpos

setpos

.gitignore

.gitignore

LICENSE

LICENSE

README.rst

README.rst

path.conf

path.conf

requirements.txt

requirements.txt

requirements_extra.txt

requirements_extra.txt

setup.py

setup.py

Repository files navigation

Setvalued Part of Speech Tagging

Overview

Installation

Simple

Complete

Corenlp

Data

Usage

Citation

Acknowledgement

About

Releases

Packages

Languages

License

stheid/SetPOS

Folders and files

Latest commit

History

Repository files navigation

Setvalued Part of Speech Tagging

Overview

Installation

Simple

Complete

Corenlp

Data

Usage

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages