Skip to content

Some Faroese language statistics taken from fo.wikipedia.org content dump

License

Notifications You must be signed in to change notification settings

macbre/faroese-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

faroese-corpus

Faroese corpus taken from Wikipedia dumps.

This repository will contain corpus of Faroese language taken from the content dump of Faroese Wikipedia.

pipenv

This project uses pipenv. How to install pipenv.

Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:

pipenv install
sudo apt install libarchive-dev

Links

Scripts

Run pipenv shell before running them.

words_from_dump.py

Shows the longest words taken from the dump:

1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29