dutchembeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

All embeddings are released under the CC-BY-SA-4.0 license.

The software is released under the GNU GPL 2.0.

These embeddings have been created with the support of Textgain®.

Embeddings

To download the embeddings, please click any of the links in the following table. In almost all cases, the 320-dimensional embeddings outperform the 160-dimensional embeddings.

Corpus	160	320
Roularta	link (mirror)	link (mirror)
Wikipedia	link (mirror)	link (mirror)
Sonar500	link (mirror)	link (mirror)
Combined	link (mirror)	link (mirror)
COW	-	small (mirror), big (mirror)

See below for a usage explanation.

Citing

If you use any of the resources from this paper, please cite our paper, as follows:

@InProceedings{tulkens2016evaluating,
  author = {Stephan Tulkens and Chris Emmery and Walter Daelemans},
  title = {Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }

Please also consider citing the corpora of the embeddings you use. Without the people who made the corpora, the embeddings could never have been created.

Usage

The embeddings are currently provided in .txt files which contain vectors in word2vec format, which is structured as follows:

The first line contains the size of the vectors and the vocabulary size, separated by a space.

Ex: 320 50000

Each line thereafter contains the vector data for a single word, and is presented as a string delimited by spaces. The first item on each line is the word itself, the n following items are numbers, representing the vector of length n. Because the items are represented as strings, these should be converted to floating point numbers.

Ex: hond 0.2 -0.542 0.253 etc.

If you use python, these files can be loaded with gensim or reach, as follows.

# Gensim
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/embedding-file')
katvec = model['kat']
model.most_similar('kat')

# Reach
from reach import Reach

r = Reach.load('path/to/embedding-file')
katvec = r['kat']
r.most_similar('kat')

Relationship dataset

If you want to test the quality of your embeddings, you can use the relation.py script. This script takes a .txt file of predicates, and creates dataset which is used for evaluation.

This currently only works with the gensim word2vec models or the SPPMI model, as defined above.

Example:

from relation import Relation

# Load the predicates.
rel = Relation('data/question-words.txt')

# load a word2vec model
model = KeyedVectors.load_word2vec_format('path/to/embedding-file')

# Test the model
rel.test_model(model)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
cooccur_matrix.pyx		cooccur_matrix.pyx
create_sppmi.py		create_sppmi.py
create_word2vec.py		create_word2vec.py
dialect.py		dialect.py
relation.py		relation.py
sentences.py		sentences.py
sppmimodel.py		sppmimodel.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

cooccur_matrix.pyx

cooccur_matrix.pyx

create_sppmi.py

create_sppmi.py

create_word2vec.py

create_word2vec.py

dialect.py

dialect.py

relation.py

relation.py

sentences.py

sentences.py

sppmimodel.py

sppmimodel.py

Repository files navigation

dutchembeddings

Embeddings

Citing

Usage

Relationship dataset

About

Releases

Packages

Contributors 3

Languages

License

clips/dutchembeddings

Folders and files

Latest commit

History

Repository files navigation

dutchembeddings

Embeddings

Citing

Usage

Relationship dataset

About

Topics

Resources

License

Stars

Watchers

Forks

Languages