Skip to content

Latest commit

 

History

History
21 lines (11 loc) · 1.03 KB

README.md

File metadata and controls

21 lines (11 loc) · 1.03 KB

Spanish Sentence Embeddings

Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

sent2vec Parameters

We set default parameters of sent2vec to train a unigram + bigram model.

Download

Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)

References

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018