Skip to content

Romanian Word Embeddings. Here you can find pre-trained corpora of word embeddings. Current methods: CBOW, Skip-Gram, Fast-Text (from Gensim library). The .vec and .model files are available for download (all in one archive).

License

Notifications You must be signed in to change notification settings

BlackKakapo/Romanian-Word-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

Romanian Word Embeddings

These vectors was trained with 3 different methods (CBOW, Skip-Gram, FastText) from Gensim library. The dataset is a bunch of text that was taken from internet (news, comments, blogs etc.).

Please notice I do not claim that these vectors are the best for romanian language!

About Dataset

The text is pre processed and cleaned.

  • 784.150.193 - Sentences (One of purpose is the sentence need to be bigger than 35 characters, including the spaces)
  • 11.628.712.127 - Words
  • 1.311.442 - Unique words
  • 14,82 - AVG number of words in sentence

Method Size Min_count Window SET1 - Precision SET2 - Precision Download Size
CBOW 300 25 5 14% 23% Download 4.2 GB
CBOW 300 25 15 66% 91% Download 4.2 GB
CBOW 300 25 20 67% 93% Download 4.2 GB
Skip-Gram 300 25 5 71% 92% Download 4.2 GB
Skip-Gram 300 25 15 79% 98% Download 4.2 GB
Skip-Gram 300 25 20 79% 98% Download 4.2 GB
FastText 300 25 5 66% 95% Download 6.29 GB
FastText 300 25 15 72% 97% Download 6.29 GB
FastText 300 25 20 74% 98% Download 6.29 GB

SET1 and SET2 are sets with questions-answer with country and capitals, that was made by Romanian Academy (They have their own vectors, you can check it right there CoRoLa).

Example:
  • austria - vienna + amsterdam = netherlands (eng).
  • austria - viena + amsterdam = olanda (rom).

SET1 - 1892 analogies for European countries and their capitals

SET2 - 462 analogies for European countries and their capitals (subset of SET1)

from gensim.models import Word2Vec

model = Word2Vec.load('SG_300_20_15.model')

resultQuery = model.wv.most_similar('**WORD**')

for result in resultQuery:
    print(result)
    
In: spania
Out:
('italia', 0.8326004147529602)
('portugalia', 0.8248708248138428)
('castilla-leon', 0.7556794285774231)
('belgia', 0.7364105582237244)
('argentina', 0.7281147241592407)
('spania-', 0.727818489074707)
('brazilia', 0.7213218212127686)
('olanda', 0.6885160207748413)
('germania', 0.6858677864074707)
('anglia', 0.6833646297454834)

In: ilie
Out:
('adrian', 0.7264171242713928)
('andrei', 0.7138616442680359)
('valentin', 0.6969763040542603)
('dumitru', 0.673446536064148)
('llie', 0.6705739498138428)
('nicolae', 0.6643682718276978)
('vasile', 0.6577962636947632)
('marian', 0.6359540224075317)
('constantin', 0.6084895133972168)
('nicu', 0.6063842177391052)

In: ruble
Out: 
('grivne', 0.795340359210968)
('hrivne', 0.7273794412612915)
('copeici', 0.7101361155509949)
('dolari', 0.6791703104972839)
('yuani', 0.6516059041023254)
('rublă', 0.6284367442131042)
('kopeici', 0.6272767186164856)
('zloţi', 0.6005615592002869)
('usd', 0.5963905453681946)
('piaştri', 0.5942535996437073)

In: fizician
Out:
('matematician', 0.6948787569999695)
('savant', 0.6890636086463928)
('fizicianul', 0.6560385823249817)
('inventator', 0.653334379196167)
('astrofizician', 0.644870936870575)
('chimist', 0.6142269372940063)
('astronom', 0.6096892356872559)
('filozof', 0.604558527469635)
('teoretician', 0.6006152629852295)
('cercetător', 0.5948916673660278)

About

Romanian Word Embeddings. Here you can find pre-trained corpora of word embeddings. Current methods: CBOW, Skip-Gram, Fast-Text (from Gensim library). The .vec and .model files are available for download (all in one archive).

Topics

Resources

License

Stars

Watchers

Forks