Skip to content

alevatri/diorisis

Repository files navigation

diorisis

Annotation scripts to generate the Diorisis Ancient Greek Corpus (https://figshare.com/articles/The_Diorisis_Ancient_Greek_Corpus/6187256).

Folder paths are to be configured in config.ini. TreeTaggerData.zip and grkFrm.py.zip need to be extracted into the root folder. TreeTagger must be downloaded and installed independently from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

Pipeline:

  1. tokenizePerseus.py: tokenizer for Perseus GitHub corpus files.
  2. tokenizeConverted.py: tokenizer for open source corpus files from other collections than Perseus (preliminarily converted into XML files).
  3. corporaParser.py: this script annotates the tokenized corpus files (available from https://figshare.com/articles/Diorisis_Corpus_-_Preprocessed_files/7229162).
  4. TT_corpus_run_and_compare.py: this script runs TreeTagger on annotated corpus data and checks how many tokens with multiple lemma annotation are disambiguable. For each token, TreeTagger may select a POS represented by n lemmata (probability of disambiguation: 1/n). In the best case scenario, TreeTagger identifies a POS represented by only one lemma; in the worst case, it will identify a POS not represented by any lemma in the annotated corpus (probability = 0 by default). The script generate statistics for each file in the corpus. Paths are configured in config.ini.
  5. convert_corpus.py: this script converts the annotated corpus (created through corporaParser.py) into a version in which lemmas are disambiguated with TreeTagger (stored in the folder specified under final_corpus in config.ini).

About

Preprocessed texts and annotation scripts to generate the Diorisis Ancient Greek Corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages