Skip to content

Latest commit

 

History

History
48 lines (29 loc) · 1.79 KB

README.md

File metadata and controls

48 lines (29 loc) · 1.79 KB

German Word Frequencies

Simple word to frequency mappings for the german language based on text corpora and using CISTEM stemmer. May be useful for various purposes.

Data

cow16 (~ 42 million unique stemmed words)

The source data already contains a frequency list, but still was preprocessed using the routine in the decow/ folder.

Word Frequencies

License & Attribution

The original corpus is licensed under Creative Commons Attribution 4.0.

opensubtitles (~ 900k unique stemmed words)

Word Frequencies

License & Attribution

P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Example Usage

Download and extract one of the archives. Then use it like this (warning: this way it may use much memory):

import pandas as pd
import nltk

word = 'Onlineumfrage'

stemmer = nltk.stem.Cistem()
df = pd.read_csv('~/decow_wordfreq_cistem.csv', index_col=['word'])
df.at[stemmer.stem(word), 'freq'] # => 8490