Skip to content

dohliam/hawaiian-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

hawaiian-corpus - Data from a corpus of written Hawaiian

This repository contains data based on a corpus of texts written in the Hawaiian language (ʻŌlelo Hawaiʻi). The data includes frequency lists, stopwords, and lists of most common n-grams. The text in the corpus was obtained from Ulukau, the Hawaiian Electronic Library.

There are a total of 10.7 million words in the corpus, which was restricted to modern (post-20th century) and non-scriptural text. An overview of statistics for the corpus (including the top most common words and n-grams) can be seen here.

Data

Files included in this repository:

License

CC0.