Multilanguage support
This page collects tools to be verified and used to support multiple languages for corpkit.
Both NLTK and CoreNLP should work well with any European language for tokenisation. That said, language-specific tools are preferable when available and easy to implement.
NLTK's tokenisers are already shipped with corpkit, so language-specific tokenising for many languages should be easy enough to implement.
Japanese tokenisation looks doable via tinysegmenter.
NLTK contains Snowball-like tokenisers for many European languages:
- Danish
- Dutch
- English
- Finnish
- French
- German
- Hungarian
- Italian
- Norwegian
- Portuguese
- Romanian
- Russian
- Spanish
- Swedish
Lemmatisation is generally preferable to stemming, but it requires POS tagging, which is not currently possible for every language.
CoreNLP has POS taggers for English, Arabic, Chinese, French, German, and Spanish.
estNLTK has POS tagging for Estonian.
PyMorphy has POS tagging for Russian
For English, corpkit uses CoreNLP's lemmatiser when doing dependency searching, and WordNet the rest of the time.
Even if no lemmatiser is currently available, it isn't too hard to write a lemmatiser, so long as the language has a POS tagger and a WordNet-style database.
Morphology analysis tools contain stemmers, lemmatisers and POS taggers, among other things.
Language | Type | Tool | Status | Comments |
---|---|---|---|---|
Estonian | NLTK-compatible | estNLTK | Not started | |
Russian | Custom | pyMorphy | Not started |
Only a few languages have high-quality syntactic parsers available. Below is a (work in progress) list of currently available parsers a
Language | Grammar(s) | Tool | Status | Comments |
---|---|---|---|---|
English | Dependency, constituency | Stanford CoreNLP | Implemented | Full support |
Chinese | Dependency, constituency | Stanford CoreNLP | Not started | Needs available custom models |
Spanish | Dependency, constituency | Stanford CoreNLP | Not started | Needs available custom models |
German | Dependency, constituency | Stanford CoreNLP | Not started | Needs custom models with unknown availability |
Arabic | Dependency, constituency | Stanford CoreNLP | Not started | Needs custom models with unknown availability |