aXa-trees

aXa stands for arbustum ΧΡΙΣΤΙΑΝΩΝ antiquum, "ancient forest of Christians". Its symbol is a nomen sacrum evoking ancient scribal practices and a nod to the centrality of Christian scripture in historical corpus linguistics.

The Problem

While lots of other treebanks of New Testament texts exist, most of them do not conform to established international standards. This makes parsing and comparison with most Classical and Postclassical Greek linguistic corpora difficult.

The Background

These treebanks simultaneously follow and adopt both of the leading international standards for syntactical treebanks: Universal Dependencies (UD) and Ancient Greek Dependency Treebanks (AGDT). The UD standard is itself a combination, specifically of the Stanford dependencies and Google universal tags. Universal Dependencies now extends to test/train data and models in over 200 languages. The AGDT standard was developed among leading scholars of Classical Greek, and it boasts the greatest amount of gold-standard, expert-tagged syntactical data of Classical Greek in the world (over 500k tokens), with millions more tokens produced in a largely identical format by the GlauX project.

The Purpose

By adopting both UD and AGDT formats, then aligning and combining them into a single representation, we maximize the potential for cross-validation, for the use of these treebanks as crosswalks between the two leading formats, and for broad accessibility and comparison within computational linguistic analyses and applications.

The Credits

These Postclassical Greek syntactical trees (aka dependency treebanks) are (to varying degrees) typically based on earlier openly licensed treebanks and projects as detailed below.

PROIEL - Universal Dependencies format - CoNLL-U

The PROIEL treebanks include most Greek New Testament texts based on the Editio Octava (Eighth Edition) of the Greek New Testament (GNT) by Constantin Tischendorf (1869/1872), a work now in the public domain. The latest version of the PROIEL xml and conllu files may be found in the Syntacticus repo. The conllu data used in this repo are largely derived from the tripartite-split files used in the Universal Dependencies Treebanks PROIEL model. When specific verses, chapter, or books are missing, these are patched with a combination routine: extracting UD-parallel elements from the PROIEL xml and pulling other UD elements by passing the patch text through the UDpipe PROIEL model, with expert edits and corrections as needed. The PROIEL Treebank is freely available under a CC BY-NC-SA 3.0 license.

Dag T. T. Haug and Marius L. Jøhndal. 2008. "Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008), pp. 27-34. doi:10.5281/zenodo.10655061.

GlauX - Ancient Greek and Latin Dependency format - XML

The GlauX treebanks of New Testament texts were produced based on the manual tagging of the PROIEL project, again following Tischendorf's GNT, with automated, semi-automated, and manual transformations applied. The GlauX treebanks follow the AGDT standard in most respects, and were produced under the expert direction of Toon van Haal and Alek Keersmaekers at KU-Leuven. They are now available in Alek Keersmaekers' GlauX repo under a CC BY-SA 3.0 license. For a representative publication of their work, see:

Keersmaekers, A. 2021. "The GLAUx corpus: methodological issues in designing a long-term, diverse, multi-layered corpus of Ancient Greek." Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021, 39–50. Online: Association for Computational Linguistics. doi:10.18653/v1/2021.lchange-1.6.

The Invitation

Collaborators are welcome to join in our efforts to curate gold-standard UD- and AGDT-treebanks of early Christian texts. The most effective way to contribute is to make pull requests and propose edits to the edited source files (xml and conllu), rather than the original files or the combined output, since the original files are needed for reference and the combined output is generated with a scripting pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github		.github
0031.001		0031.001
0031.002		0031.002
0031.003		0031.003
0031.004		0031.004
0031.005		0031.005
0031.006		0031.006
0031.007		0031.007
0031.008		0031.008
0031.009		0031.009
0031.010		0031.010
0031.011		0031.011
0031.012		0031.012
0031.013		0031.013
0031.014		0031.014
0031.015		0031.015
0031.016		0031.016
0031.017		0031.017
0031.018		0031.018
0031.019		0031.019
0031.020		0031.020
0031.027		0031.027
LICENSE.md		LICENSE.md
README.md		README.md
treebanks.md		treebanks.md

License

mgbilby/aXa-trees

Folders and files

Latest commit

History

Repository files navigation

aXa-trees

The Problem

The Background

The Purpose

The Credits

PROIEL - Universal Dependencies format - CoNLL-U

GlauX - Ancient Greek and Latin Dependency format - XML

The Invitation

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project