Spanish Sentence Embeddings

Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.

Pre-Processing

The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.

According to that tokenization, the 2.6B words corpus got into 3.4B tokens.

sent2vec Parameters

We set default parameters of sent2vec to train a unigram + bigram model.

Download

Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)

References

Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spanish Sentence Embeddings

Pre-Processing

sent2vec Parameters

Download

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spanish Sentence Embeddings

Pre-Processing

sent2vec Parameters

Download

References