Gaelic Resources

A list of computational resources for Gaelic.

This list has grown out of https://github.com/RichardLitt/endangered-languages, my list for all open source resources for low resource languages. I'm particularly interested in Gaelic, going forward.

Tools

Hunspell-gd

Kevin Scannell has a repository with data files and scripts for building Scottish Gaelic spell checkers. This script was started through the Crúbadán project. GPL Licensed. This hunspell-gd repo is likely derivative.

Corpora

Annotated Reference Corpus of Scottish Gaelic (ARCOSG)

A representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files.

The corpus was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (see Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).

The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2014) Scottish Gaelic Part-of-Speech Annotation Guidelines.

This work was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.

DASG Corpus na Gàidhlig

Corpas na Gàidhlig is a constituent project of DASG. It was founded in 2008 with the following aims: to create a comprehensive electronic corpus of Scottish Gaelic texts for students and researchers of Scottish Gaelic language, literature and culture to provide the textual basis for the interuniversity project Faclair na Gàidhlig (‘Dictionary of the Scottish Gaelic Language’) upon which the future historical dictionary will be based to provide a resource which will facilitate corpus planning and corpus development technology for Gaelic The first phase of Corpas na Gàidhlig aims to digitise 337 texts from all periods of Gaelic literature and to include a wide variety of genres, including poetry, prose, song, and folklore. These texts (listed below) have been prioritised in order to provide part of the textual basis for the interuniversity dictionary project, Faclair na Gàidhlig. It is envisaged as Corpas na Gàidhlig progresses that a broad range of other texts will be added, and in time, that speech will also be represented by text and sound files. In the long term, the Corpus will be used to update the dictionary.

To date over 19 million words, mostly Gaelic, have been captured.

The 337 texts to be digitised as part of Phase 1 are listed here (if the appropriate permissions are received).

Lancaster Scottish Gaelic corpus

Corpus contents:

conversation.txt - an informal conversation lecture.txt - a university lecture on philosophy sermon.txt - a sermon from a Church of Scotland communion service service.txt - a second sermon talk.txt - an informal educational/historical/religious talk All files are encoded in UTF-8 format.

Contribute

Please add stuff!

License

The Unlicense

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Gaelic Resources

Tools

Hunspell-gd

Corpora

Annotated Reference Corpus of Scottish Gaelic (ARCOSG)

DASG Corpus na Gàidhlig

Lancaster Scottish Gaelic corpus

Contribute

License

About

Releases

Packages

License

RichardLitt/gaelic-resources

Folders and files

Latest commit

History

Repository files navigation

Gaelic Resources

Tools

Corpora

Contribute

License

About

Topics

Resources

License

Stars

Watchers

Forks