Coptic Scriptorium - Corpora

This is the public repository for Coptic SCRIPTORIUM corpora. The documents are available in multiple formats: CoNLL-U, relANNIS, PAULA XML, TEI XML, and TreeTagger SGML (*.tt). The *.tt files generally contain the most complete representations of document annotations, though note that corpus level metadata is only included in the PAULA XML and relANNIS versions.

Corpora can be searched, viewed, and queried with complex queries http://data.copticscriptorium.org. Project homepage is http://copticscriptorium.org

Metadata and annotation quality

Metadata about each document is most easy to obtain by looking at the first line of files in the respective *_TT directory of each corpus. Five types of annotation quality metadata are available:

segmentation (Coptic internal word splitting or bound groups and morphemes)
tagging (parts of speech, using the Scriptorium fine-grained tagset)
parsing (Universal Dependencies parses)
entities (classification of all referring expressions into 10 categories)
identities (linking of all named entity spans to corresponding Wikipedia articles, a.k.a. Wikification)

Values for these metadata are:

automatic - machine annotations only
checked - checked for accuracy by an expert in Coptic
gold - closely reviewed for accuracy, usually as a result of treebanking

Notes on duplicates and redundancies

Some of the data in this repository contains duplicate information. In particular, the coptic-treebank corpus is a convenient collection of all gold-standard treebanked data (manual syntactic analyses), all of which is included in other source corpora (which are often not 100% gold parsed). The documents in the treebank are identical to the same documents in the source corpora (e.g. XH204-216 is included in both its source corpus folder shenoute-fox and the treebank).

Additionally, individual book corpora from the Old and New Testaments with some or all manual annotations (sahidica.mark, sahidica.1corinthians, sahidic.ruth) are also represented in the large and completely automatically annotated sahidica.nt and sahidic.ot. Versions of documents from these sources may differ slightly in the analyses in these corpora, and the individual book corpora are generally more accurate.

Finally, some documents represent parallel witnesses of other documents (different manuscript versions of the same conceptual text). These are not necessarily text-identical to each other, but quantitative work in which double-counting the same or very similar text is undesirable may wish to filter these out. They can be identified in *.tt files by the metadatum redundant="yes".

Sources and licenses

All the documents are licensed CC-BY 3.0 (https://creativecommons.org/licenses/by/3.0/us/) or 4.0 (https://creativecommons.org/licenses/by/4.0/) unless otherwise indicated. Major exceptions include:

Sahidica New Testament specific license (http://www.copticscriptorium.org/download/corpora/Mark/coptic_nt_sahidic.html)
Canons of Apa Johannes CC-BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
Sahidic Old Testament CC-BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0/)

Individual files also contain licensing information.

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
AP		AP
abraham		abraham
besa-letters		besa-letters
bible		bible
coptic-treebank		coptic-treebank
doc-papyri		doc-papyri
dormition-john		dormition-john
johannes-canons		johannes-canons
john-constantinople		john-constantinople
life-aphou		life-aphou
life-cyrus		life-cyrus
life-eustathius-theopiste		life-eustathius-theopiste
life-john-kalybites		life-john-kalybites
life-longinus-lucius		life-longinus-lucius
life-onnophrius		life-onnophrius
life-paul-tamma		life-paul-tamma
life-phib		life-phib
life-pisentius		life-pisentius
magical-papyri		magical-papyri
martyrdom-victor		martyrdom-victor
mysteries-john		mysteries-john
pachomius-instructions		pachomius-instructions
pistis-sophia		pistis-sophia
proclus-homilies		proclus-homilies
pseudo-athanasius-discourses		pseudo-athanasius-discourses
pseudo-basil		pseudo-basil
pseudo-celestinus		pseudo-celestinus
pseudo-chrysostom		pseudo-chrysostom
pseudo-ephrem		pseudo-ephrem
pseudo-flavianus		pseudo-flavianus
pseudo-theophilus		pseudo-theophilus
pseudo-timothy		pseudo-timothy
sahidic.ot		sahidic.ot
sahidic.ruth		sahidic.ruth
sahidica.1corinthians		sahidica.1corinthians
sahidica.mark		sahidica.mark
sahidica.nt		sahidica.nt
shenoute-a22		shenoute-a22
shenoute-considering		shenoute-considering
shenoute-crushed		shenoute-crushed
shenoute-dirt		shenoute-dirt
shenoute-eagerness		shenoute-eagerness
shenoute-fox		shenoute-fox
shenoute-night		shenoute-night
shenoute-place		shenoute-place
shenoute-prince		shenoute-prince
shenoute-seeks		shenoute-seeks
shenoute-those		shenoute-those
shenoute-thundered		shenoute-thundered
shenoute-true		shenoute-true
shenoute-uncertain-xr		shenoute-uncertain-xr
shenoute-unknown5_1		shenoute-unknown5_1
README.md		README.md

CopticScriptorium/corpora

Folders and files

Latest commit

History

Repository files navigation

Coptic Scriptorium - Corpora

Metadata and annotation quality

Notes on duplicates and redundancies

Sources and licenses

About

Resources

Stars

Watchers

Forks

Languages