Skip to content

Herodotos-Project/Herodotos-Project-Latin-NER-Tagger-Annotation

Repository files navigation

Herodotos Project NER Annotation and Tagger

This repository includes texts annotated for named entities as part of the Herodotos Project (Ohio State University / Ghent University) as well as a BiLSTM-CRF (Lample et al., 2016) NER tagger pre-trained on said annotation. Please check out the Humanities Entity Recognizer for more details on how it was trained.

Annotation

All texts are in Latin taken from the Latin Library Collection (collected by CLTK) or the Perseus Latin Collection. Greek will be added soon.

The data files in the Annotation directory were annotated for named entities by a team of Classics experts at Ohio State University. Texts presently included are excerpts from Caesar's Wars, both Gallic (GW) and Civil (CW), the Plinies' writings, both Elder and Younger, and Ovid's Ars Amatoria.

Names of peoples are annotated as GRP; names of persons are annotated as PRS; and names of geographical places are annotated as GEO in the BIO scheme.

Further information on the corpus, including splits for training and testing, can be found in Erdmann et al. (2016), "Challenges and Solutions for Latin Named Entity Recognition." For citation purposes however, please see the Acknowledgments section below for the more recent/relevant publication to cite.

Tagger

The Herodotos Project Latin NER Tagger is trained on the entire set of Latin data included in this repository using the BiLSTM-CRF architecture of Lample et al. (2016), "Neural Architectures for Named Entity Recognition".

Prerequisites

To run the tagger, make sure the below packages and any dependencies have been installed.

Usage

The tagger can be called with the following commands:

cd Herodotos_Project_Latin_NER_tagger
python tagger.py --input sample.in.tok > sample.out.tags

The input should already be tokenized with clitics separated and one sentence per line, as in sample.in.tok. Near optimal performance can be achieved by simply using punctuation-and-white-space or even just white-space tokenization due to the relative infrequency of Latin cliticization and the tagger's robust handling of character-level features. The output will return all identified named entities for each line as triples. Each triple contains the following information: (1) character offset within the corresponding line where the named entity starts (2) the full span of the named entity (3) the label of the named entity.

Alternative supported input formats can be specified with the --inputFormat option. They include the conll and crfsuite formats. Conll is one token per line, followed by a tab, then its label (though since we're predicting the label, it doesn't matter what you actually put as the label). Sentence breaks are indicated by a blank line. See sample.in.conll for an example.

python tagger.py --input sample.in.conll --inputFormat conll > sample.out.tags

Crfsuite formatting is the same as conll but the token-label order is reversed. See sample.in.crf for an example.

python tagger.py --input sample.in.crf --inputFormat crf > sample.out.tags

You can also request different output formats via the --outputFormat option. The following example will output to crfsuite format:

python tagger.py --input sample.in.tok --outputFormat crf > sample.out.crf

And this will output to conll format:

python tagger.py sample.in.tok --outputFormat conll > sample.out.conll

Alternatively, you can print out a list of all unique entities identified by label with the list option:

python tagger.py sample.in.tok --outputFormat list > sample.out.list

And of course, any combination of input and output formats is supported.

Acknowledgments

If you find either the tagger or the data useful in any way, please cite our forthcoming publication:

Contact ae1541@nyu.edu or any of the co-authors with questions regarding this repository.

About

Latin texts annotated for named entities and NER tagger used for the Herodotos Project (Ohio State University / Ghent University)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •