Skip to content

Spanish Sentence Embeddings computed from large corpora using sent2vec.

License

Notifications You must be signed in to change notification settings

BotCenter/spanish-sent2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Spanish Sentence Embeddings

Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

sent2vec Parameters

We set default parameters of sent2vec to train a unigram + bigram model.

Download

Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)

References

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018

Releases

No releases published

Packages

No packages published