Skip to content

Multilanguage support

Daniel edited this page Nov 3, 2015 · 2 revisions

Tools for various languages

This page collects tools to be verified and used to support multiple languages for corpkit.

Tokenisers

Both NLTK and CoreNLP should work well with any European language for tokenisation. That said, language-specific tools are preferable when available and easy to implement.

NLTK's tokenisers are already shipped with corpkit, so language-specific tokenising for many languages should be easy enough to implement.

Japanese tokenisation looks doable via tinysegmenter.

Stemmers

NLTK contains Snowball-like tokenisers for many European languages:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish

Lemmatisation is generally preferable to stemming, but it requires POS tagging, which is not currently possible for every language.

POS taggers

CoreNLP has POS taggers for English, Arabic, Chinese, French, German, and Spanish.

estNLTK has POS tagging for Estonian.

PyMorphy has POS tagging for Russian

Lemmatisers

For English, corpkit uses CoreNLP's lemmatiser when doing dependency searching, and WordNet the rest of the time.

Even if no lemmatiser is currently available, it isn't too hard to write a lemmatiser, so long as the language has a POS tagger and a WordNet-style database.

Morphology analysis

Morphology analysis tools contain stemmers, lemmatisers and POS taggers, among other things.

Language Type Tool Status Comments
Estonian NLTK-compatible estNLTK Not started
Russian Custom pyMorphy Not started

Syntactic parsers

Only a few languages have high-quality syntactic parsers available. Below is a (work in progress) list of currently available parsers a

Language Grammar(s) Tool Status Comments
English Dependency, constituency Stanford CoreNLP Implemented Full support
Chinese Dependency, constituency Stanford CoreNLP Not started Needs available custom models
Spanish Dependency, constituency Stanford CoreNLP Not started Needs available custom models
German Dependency, constituency Stanford CoreNLP Not started Needs custom models with unknown availability
Arabic Dependency, constituency Stanford CoreNLP Not started Needs custom models with unknown availability