Faroese implementation of ABLTagger

This repository contains data, models, and scripts used to train and evaluate ABLTagger, a BiLSTM based neural PoS tagger, on Faroese.

Using the ~100.000 token Sosialurin corpus with a revised tagging scheme and an experimental morphological database for Faroese, the trained model acheives an overall accuracy of 91.40% when evaluated using 10-fold cross validation.

Whole corpus files:
- original.txt - The corpus unchanged from 2004/2011
- fo.txt - Comments removed and whitespace issues
- fo.cleaned.txt - Tokenization issues fixed and newlines added
- fo.revised-verbs-unchanged.txt - Same as above, with revised tagset except for plural verbs
- fo.revised.txt - Fully revised tagset and tokenization
10 fold splits of three versions of the corpus (for cross validation)
Tagset descriptions - Both original and revised
Original license waiver from 2011
Description of contents

Inflection data

As ABLTagger makes use of a morphological database in the DIM basic format, an Experimental Database of Faroese Morphology (EDFM) was compiled from various sources and formatted in this manner, in order to use with ABLTagger. The EDFM contains about 1.000.000 inflectional forms in 67,180 individual paradigms. This is contained within the file edfm.csv.

The contents of EDFM are described below. The sources of inflectional paradigms were the Faroese dictionary foundation (OBG), the Faroese naming committee (Navnanevndin) and Wiktionary (via UniMorph). Additionally paradigms were generated from OBG data using scripts (OBG-gen) and various paradigms, mostly non-inflecting words, were created manually. The OBG and Navnanevndin paradigms were accessed via Sprotin.fo

Word class	OBG	Navnanevndin	Wiktionary	OBG-gen	Manual	Total
Adjectives	11,907	-	16	-	-	11,923
Adverbs	1,289	-	-	-	-	1,289
Conjunctions	-	-	-	-	61	6
Interjections	-	-	-	-	115	115
Nouns	46,492	1,667	113	-	-	48,272
Numerals	-	-	-	47	57	104
Prepositions	-	-	-	-	62	62
Pronouns	-	-	-	-	20	20
Verbs	-	-	7	5,327	-	5,334
Total	59,688	1,667	136	5,374	315	67,180

Scripts used

The various scripts used in all stages of the project are in the scripts folder. These are ordered into three groups, into inflection, tagset and corpus_stuff, indicating what the scripts were used for. Other than that, they are not organized specifically

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ABLTagger		ABLTagger
corpus		corpus
inflection/edfm-v0.1		inflection/edfm-v0.1
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ABLTagger

ABLTagger

corpus

corpus

inflection/edfm-v0.1

inflection/edfm-v0.1

scripts

scripts

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Faroese implementation of ABLTagger

Contents

far-ABLTagger

Sosialurin corpus

Inflection data

Scripts used

About

Languages

hinrikur/far-ABLTagger

Folders and files

Latest commit

History

Repository files navigation

Faroese implementation of ABLTagger

Contents

far-ABLTagger

Sosialurin corpus

Inflection data

Scripts used

About

Topics

Resources

Stars

Watchers

Forks

Languages