Skip to content

RichardLitt/gaelic-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Gaelic Resources

A list of computational resources for Gaelic.

This list has grown out of https://github.com/RichardLitt/endangered-languages, my list for all open source resources for low resource languages. I'm particularly interested in Gaelic, going forward.

Tools

Kevin Scannell has a repository with data files and scripts for building Scottish Gaelic spell checkers. This script was started through the Crúbadán project. GPL Licensed. This hunspell-gd repo is likely derivative.

Corpora

A representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files.

The corpus was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (see Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).

The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2014) Scottish Gaelic Part-of-Speech Annotation Guidelines.

This work was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.

Corpas na Gàidhlig is a constituent project of DASG. It was founded in 2008 with the following aims: to create a comprehensive electronic corpus of Scottish Gaelic texts for students and researchers of Scottish Gaelic language, literature and culture to provide the textual basis for the interuniversity project Faclair na Gàidhlig (‘Dictionary of the Scottish Gaelic Language’) upon which the future historical dictionary will be based to provide a resource which will facilitate corpus planning and corpus development technology for Gaelic The first phase of Corpas na Gàidhlig aims to digitise 337 texts from all periods of Gaelic literature and to include a wide variety of genres, including poetry, prose, song, and folklore. These texts (listed below) have been prioritised in order to provide part of the textual basis for the interuniversity dictionary project, Faclair na Gàidhlig. It is envisaged as Corpas na Gàidhlig progresses that a broad range of other texts will be added, and in time, that speech will also be represented by text and sound files. In the long term, the Corpus will be used to update the dictionary.

To date over 19 million words, mostly Gaelic, have been captured.

The 337 texts to be digitised as part of Phase 1 are listed here (if the appropriate permissions are received).

Corpus contents:

conversation.txt - an informal conversation lecture.txt - a university lecture on philosophy sermon.txt - a sermon from a Church of Scotland communion service service.txt - a second sermon talk.txt - an informal educational/historical/religious talk All files are encoded in UTF-8 format.

Contribute

Please add stuff!

License

The Unlicense