TreeTagger part-of-speech tagging models for Sahidic Coptic

Version: 1.10 (includes POS tagging and lemmatization, with DDGLC Greek lemma information) Model source file: coptic_fine10.par / coptic_coarse10.par

The part-of-speech tagging models are for use with the freely available TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The models are based on the guidelines of the Coptic SCRIPTORIUM project, which closely follow Layton's (2011) grammar. The lexicon used by the tagger is based on a lexicon kindly provided by Prof. Tito Orlandi and the CMCL project (http://cmcl.let.uniroma1.it/) and a lemma list provided by Prof. Tonio Sebastian Richter and the DDGLC project (http://research.uni-leipzig.de/ddglc/). Please cite the CMCL and DDGLC projects whenever publishing research using the tagging models.

There are two different models: one for the coarse grained tagset, with 22 tags, and one for the fine grained tagset, which distinguishes 44 tags (including individual tags for each positive and negative conjugation base). For details on the tagset, see the documentation on the Coptic SCRIPTORIUM web page.

To use the models, download and unzip the TreeTagger. In the folder bin/ you will find the TreeTagger excutable, which requires one of the two parameter files to run. TreeTagger also expects an input file in a one-token-per-line format. For exaple, the input file input.txt could include the following tokens (in UTF-8! The ascii characters below are for illustration purposes only):

p
noute
pe
.

These will be tagged as:

p	ART
noute	N
pe	COP
.	PUNCT

To run the tagger, run the TreeTagger excutable as follows (Windows example):

tree-tagger.exe coptic_fine.par -token input.txt output.txt

Or to include lemmas in a third column in the output use:

tree-tagger.exe coptic_fine.par -token -lemma input.txt output.txt

The option -token tells the TreeTagger that the input is already tokenized. For a Coptic tokenizer, see the Coptic SCRIPTORIUM project web page. Further options, such as allowing for SGML tags in the input or outputting the word form as a lemma when the lemma is unknown, are documented in the TreeTagger documentation. For the coarse grained tags use coptic_coarse.par instead of coptic_fine.par.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Coptic SCRIPTORIUM lemmatization guidelines.pdf		Coptic SCRIPTORIUM lemmatization guidelines.pdf
Coptic Sentence Segmentation Guidelines.pdf		Coptic Sentence Segmentation Guidelines.pdf
README.md		README.md
SCRIPTORIUMDiplTranscriptionGuidelines.pdf		SCRIPTORIUMDiplTranscriptionGuidelines.pdf
coptic_coarse.par		coptic_coarse.par
coptic_fine.par		coptic_fine.par
scriptorium-transcription-guidelines.docx		scriptorium-transcription-guidelines.docx
scriptorium-transcription-guidelines.pdf		scriptorium-transcription-guidelines.pdf
scriptorium_tagset_documentation.pdf		scriptorium_tagset_documentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coptic SCRIPTORIUM lemmatization guidelines.pdf

Coptic SCRIPTORIUM lemmatization guidelines.pdf

Coptic Sentence Segmentation Guidelines.pdf

Coptic Sentence Segmentation Guidelines.pdf

README.md

README.md

SCRIPTORIUMDiplTranscriptionGuidelines.pdf

SCRIPTORIUMDiplTranscriptionGuidelines.pdf

coptic_coarse.par

coptic_coarse.par

coptic_fine.par

coptic_fine.par

scriptorium-transcription-guidelines.docx

scriptorium-transcription-guidelines.docx

scriptorium-transcription-guidelines.pdf

scriptorium-transcription-guidelines.pdf

scriptorium_tagset_documentation.pdf

scriptorium_tagset_documentation.pdf

Repository files navigation

TreeTagger part-of-speech tagging models for Sahidic Coptic

About

Releases 8

Packages

Contributors 3

CopticScriptorium/tagger-part-of-speech

Folders and files

Latest commit

History

Repository files navigation

TreeTagger part-of-speech tagging models for Sahidic Coptic

About

Resources

Stars

Watchers

Forks