diorisis

Annotation scripts to generate the Diorisis Ancient Greek Corpus (https://figshare.com/articles/The_Diorisis_Ancient_Greek_Corpus/6187256).

Folder paths are to be configured in config.ini. TreeTaggerData.zip and grkFrm.py.zip need to be extracted into the root folder. TreeTagger must be downloaded and installed independently from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

Pipeline:

tokenizePerseus.py: tokenizer for Perseus GitHub corpus files.
tokenizeConverted.py: tokenizer for open source corpus files from other collections than Perseus (preliminarily converted into XML files).
corporaParser.py: this script annotates the tokenized corpus files (available from https://figshare.com/articles/Diorisis_Corpus_-_Preprocessed_files/7229162).
TT_corpus_run_and_compare.py: this script runs TreeTagger on annotated corpus data and checks how many tokens with multiple lemma annotation are disambiguable. For each token, TreeTagger may select a POS represented by n lemmata (probability of disambiguation: 1/n). In the best case scenario, TreeTagger identifies a POS represented by only one lemma; in the worst case, it will identify a POS not represented by any lemma in the annotated corpus (probability = 0 by default). The script generate statistics for each file in the corpus. Paths are configured in config.ini.
convert_corpus.py: this script converts the annotated corpus (created through corporaParser.py) into a version in which lemmas are disambiguated with TreeTagger (stored in the folder specified under final_corpus in config.ini).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
TT_corpus_run_and_compare.py		TT_corpus_run_and_compare.py
TT_format_our_data.py		TT_format_our_data.py
TreeTaggerData.zip		TreeTaggerData.zip
beta2utf.py		beta2utf.py
config.ini		config.ini
convert_corpus.py		convert_corpus.py
corporaParser.py		corporaParser.py
file_list.xlsx		file_list.xlsx
grkFrm.py.zip		grkFrm.py.zip
grkLemmata.py		grkLemmata.py
tokenizeConverted.py		tokenizeConverted.py
tokenizePerseus.py		tokenizePerseus.py
utf2beta.py		utf2beta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TT_corpus_run_and_compare.py

TT_corpus_run_and_compare.py

TT_format_our_data.py

TT_format_our_data.py

TreeTaggerData.zip

TreeTaggerData.zip

beta2utf.py

beta2utf.py

config.ini

config.ini

convert_corpus.py

convert_corpus.py

corporaParser.py

corporaParser.py

file_list.xlsx

file_list.xlsx

grkFrm.py.zip

grkFrm.py.zip

grkLemmata.py

grkLemmata.py

tokenizeConverted.py

tokenizeConverted.py

tokenizePerseus.py

tokenizePerseus.py

utf2beta.py

utf2beta.py

Repository files navigation

diorisis

About

Releases

Packages

Languages

alevatri/diorisis

Folders and files

Latest commit

History

Repository files navigation

diorisis

About

Resources

Stars

Watchers

Forks

Languages