Qnlp-ToolKit

This toolkit ais to assemble the whole tools and libraries available for NLP pre-processing at Qwant.

Installation

Automatic installation with the script:

install.sh [-h] [-g] [-p PREFIX]
-h help
-p PREFIX specify a prefix (default /usr/local/)

Tools available

qtokenizer

The very first step for every NLP process is a tokenizer: a tool which segement text into tokens. These tokens can be words or set of words. For example, "c'est-à-dire" or "aujourd'hui" are a tokens. The tokenizations rules are available for every european languages, but specific rules have been written for French and English.

qfiltering

Enable to filter bitext* according the sentence length, the sentence length ratio and other tricks.

*bitext: two bilingual corpora aligned sentence by sentence, the second one is the translation of the first one. They are used as training corpora for Neural Machine Translation

TODO

To be possible to tokenize asian, indi, african and arabic languages.

References

[Zalmout, 2017] Nasser Zalmout and Nizar Habash: "Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages", In The Prague Bulletin of Mathematical Linguistics, Vol. 108, June 2017.

qStemmer

For many users, one needs a stemmer for Information Retreival. As State-of-the-Art stemmer, we included the Snowball Stemmer, for all language available.

References

http://snowballstem.org/

qStopWords

In Information Retrieval, it is common to use stop-words list. This is now part of the qnlp-toolkit, for French, English, Portugese, Italian, Spanish and Dutch. The list comes from the website https://www.ranks.nl

References

https://www.ranks.nl

TODO

Add other european languages.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
externals/snowball		externals/snowball
include		include
src		src
test		test
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cbuild.sh		cbuild.sh
install.sh		install.sh

License

QwantResearch/qnlp-toolkit

Folders and files

Latest commit

History

Repository files navigation

Qnlp-ToolKit

Installation

Tools available

qtokenizer

qfiltering

TODO

References

qStemmer

References

qStopWords

References

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Languages