Skip to content

Ready to use Spanish Word2Vec embeddings created from >18B chars and >3B words

Notifications You must be signed in to change notification settings

aitoralmeida/spanish_word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

Ready to use gensim Word2Vec embedding models for Spanish language. Models are created using a window of +/- 5 words, discarding those words with less than 5 instances and creating a vector of 400 dimensions for each word. The text used to create the embeddings has been recovered from news, Wikipedia, the Spanish BOE, web crawling and open literary sources. The used text has a total of 3.257.329.900 words and 18.852.481.207 characters.

The models are shared at Zenodo: https://zenodo.org/record/1410403

We support two types of models: Gensim full models (complete_model.zip) and KeyedVectors (keyed_vectors.zip). You can check the differences between them in the following URL: https://radimrehurek.com/gensim/models/keyedvectors.html

To load the full model use:

model = Word2Vec.load("complete.model")

To load the KeyedVectors use:

word_vectors = KeyedVectors.load('complete.kv', mmap='r')

If you use our models in you programs or research, please use the following citation:

Aitor Almeida, & Aritz Bilbao. (2018). Spanish 3B words Word2Vec Embeddings (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1410403

Bilbao-Jayo, A., & Almeida, A. (2018). Automatic political discourse analysis with multi-scale convolutional neural networks and contextual data. International Journal of Distributed Sensor Networks, 14(11), 1550147718811827.

Other datasets

Take a look at our other datasets:

About

Ready to use Spanish Word2Vec embeddings created from >18B chars and >3B words

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published