Multilanguage support

Tools for various languages

This page collects tools to be verified and used to support multiple languages for corpkit.

Tokenisers

Both NLTK and CoreNLP should work well with any European language for tokenisation. That said, language-specific tools are preferable when available and easy to implement.

NLTK's tokenisers are already shipped with corpkit, so language-specific tokenising for many languages should be easy enough to implement.

Japanese tokenisation looks doable via tinysegmenter.

Stemmers

NLTK contains Snowball-like tokenisers for many European languages:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Portuguese
Romanian
Russian
Spanish
Swedish

Lemmatisation is generally preferable to stemming, but it requires POS tagging, which is not currently possible for every language.

POS taggers

CoreNLP has POS taggers for English, Arabic, Chinese, French, German, and Spanish.

estNLTK has POS tagging for Estonian.

PyMorphy has POS tagging for Russian

Lemmatisers

For English, corpkit uses CoreNLP's lemmatiser when doing dependency searching, and WordNet the rest of the time.

Even if no lemmatiser is currently available, it isn't too hard to write a lemmatiser, so long as the language has a POS tagger and a WordNet-style database.

Morphology analysis

Morphology analysis tools contain stemmers, lemmatisers and POS taggers, among other things.

Language	Type	Tool	Status	Comments
Estonian	NLTK-compatible	estNLTK	Not started
Russian	Custom	pyMorphy	Not started

Syntactic parsers

Only a few languages have high-quality syntactic parsers available. Below is a (work in progress) list of currently available parsers a

Language	Grammar(s)	Tool	Status	Comments
English	Dependency, constituency	Stanford CoreNLP	Implemented	Full support
Chinese	Dependency, constituency	Stanford CoreNLP	Not started	Needs available custom models
Spanish	Dependency, constituency	Stanford CoreNLP	Not started	Needs available custom models
German	Dependency, constituency	Stanford CoreNLP	Not started	Needs custom models with unknown availability
Arabic	Dependency, constituency	Stanford CoreNLP	Not started	Needs custom models with unknown availability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly