Dutch Word2Vec Model

This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora. Finding pre-trained Dutch models online can often be quite difficult, especially since most online models are trained on neatly written texts like Wikipedia or newspaper archives. When working with noisy text sources these models usually underperform due to the large number of out-of-vocabulary words used on social platforms and their short-message writing style. By training on a combination of both large and short texts from multiple online sources we've tried to create a model more suited for these types of texts.

We are sharing this model to help research using Dutch data sources, so feel free to use it for your projects! If you would like a more up-to-date model, or a model with specific preprocessing steps, we'd be happy to help! Please contact the current maintainer (@Alexander Nieuwenhuijse) if you'd like to use this model for commercial products.

Installation

The model can be downloaded using the provided utils script

$ git clone https://github.com/coosto/dutch-word-embeddings.git
$ cd dutch-word2vec-model
$ python3 utils.py download

Or directly as an asset from release page.

Usage

To run a demo (using gensim) for the downloaded model run following command. It will first output some example analogies and afterward present an interactive prompt to query for nearest neighbour terms.

$ python3 utils.py demo --model model.bin
Loading model...
Model loaded
...

Examples

If we query for "tomaat" (Dutch for tomato) we get a lot of Dutch vegetables:

Enter word or sentence: tomaat
     Term     |      Distance
-----------------------------------
paprika       |0.8452869653701782   (Bell pepper)
komkommer     |0.7932491898536682   (Cucumber)
courgette     |0.771128237247467    (Zucchini)
spinazie      |0.7697550058364868   (Spinach)
aubergine     |0.7646535634994507   (Eggplant)
rucola        |0.7631270885467529   (Arugula)
avocado       |0.7610437273979187   (Avocado)
radijs        |0.7554484605789185   (Radish)
tomaatjes     |0.7549760341644287   (Tomatoes)
tomaten       |0.7525067925453186   (Tomatoes)

Another interesting case is "lidl" (A supermarket chain), which returns a list of other supermarkets:

Enter word or sentence: lidl
      Term     |      Distance
------------------------------------
aldi           |0.9053274989128113
jumbo          |0.7790680527687073
albert_heijn   |0.7582876086235046
supermarkt     |0.7522222995758057
#lidl          |0.7320866584777832
albert_hein    |0.7161234617233276
albert_heyn    |0.7076783180236816
ekoplaza       |0.6897568702697754
vomar          |0.6830324530601501
nettorama      |0.6739382147789001

This example shows some common typographical errors for a supermarket chain called Albert Heijn. This type of errors would not be in the model is it was trained only on neatly written text, like the Dutch Wikipedia data or newspaper articles, but is included in this model because these errors are made a lot on social media.

Data selection & Modeling

The model was trained using ~600 million individual messages, comprised of Dutch social media messages (624 million messages) and Dutch news, blog and fora posts (36 million messages). All messages were published between 01/01/2017 and 31/12/2017. To improve the quality of the model some basic preprocessing was applied to all the messages, described below.

Splitting sentences

Every individual message was split into separate sentences by searching for punctuation marks, which are only considered to be an end-of-line character if it is not used in the following exceptions:

(Roman) Numerals
Single letters
Abbriviations

Clean up

Given that social media messages are usually not neatly formatted some cleaning of the text is applied:

Converting to lowercase
Removing HTML/XML tags
Replacing URLs with the <url> token
Replacing @-mentions with the <mention> token
Removing punctuation marks, emojis and unwanted unicode characters
Removing sentences with less than 5 tokens.

The URLs and @-mentions are removed for both privacy and performance reasons. The last step is applied to remove very small text messages, since they usually do not provide enough relevant context to learn from.

Deduplication

Lastly the entire set of cleaned up sentences is removed from any duplicate training examples and shuffeled into a random order. This results in a training set of 490 million unique preprocessed sentences.

Training

The model was trained using Google's Word2Vec implementation. We've selected the Continuous Bag-of-Words (CBOW) model and generate vectors of size 300. The min-count parameter was chosen based on manual inspection of the vocabulary and to limit the size of the model.

word2vec -train input.txt -output model.bin -size 300 -window 10 -negative 10 -hs 0 -cbow 1 -sample 1e-5 -iter 5 -min-count 300

The resulting model contains 250479 vectors and was not pruned or altered in any way.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License .

Please contact the current maintainer (@Alexander Nieuwenhuijse) if you wish to use this model with a different license.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE.txt		LICENSE.txt
README.md		README.md
analogies.txt		analogies.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

CITATION.cff

CITATION.cff

LICENSE.txt

LICENSE.txt

README.md

README.md

analogies.txt

analogies.txt

utils.py

utils.py

Repository files navigation

Dutch Word2Vec Model

Installation

Usage

Examples

Data selection & Modeling

Splitting sentences

Clean up

Deduplication

Training

License

About

Releases 1

Packages

Languages

License

coosto/dutch-word-embeddings

Folders and files

Latest commit

History

Repository files navigation

Dutch Word2Vec Model

Installation

Usage

Examples

Data selection & Modeling

Splitting sentences

Clean up

Deduplication

Training

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages