Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

Pre-trained Word Vector Models of 30+ Languages

License

Notifications You must be signed in to change notification settings

Astro36/word-vector-models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Vector Models

Pre-trained Word Vector Models of 30+ Languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Requirements

Required Python 3.

See requirements

Language Supports

If using below language, you also need additional library.

  • Chinese: jieba
  • Japanese: mecab
  • Korean: konlpy
  • Thai: pythai
  • Vietnamese: pyvi

Word Embeddings

If training your own model, you have to install below library.

Background / References

  • Check this to know what word embedding is.
  • Check this to quickly get a picture of Word2vec.
  • Check this to install fastText.
  • Watch this to really understand what's happening under the hood of Word2vec.
  • Go get various English word vectors here if needed.

Work Flow

  • STEP 1-1. Download the wikipedia database backup dumps of the language you want (For example, for english wiki go to https://dumps.wikimedia.org/enwiki/, click the latest timestamp, and download the enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2 file).
  • STEP 1-2. Install requirements packages.
  • STEP 2. Extract running texts to ./data directory.
  • STEP 3. Run ./src/build_corpus.py.
python build_corpus.py --lcode=ko
  • STEP 4-1. Run ./src/train_word2vec.py to get Word2Vec word vectors:
python train_word2vec.py --lcode=en --vector_size=300 --window_size=5 --vocab_size=50000 --num_negative=5
  • STEP 4-2. Run ./script/fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Check language code here.

Language ISO 639-1 Vector Size Corpus Size Vocabulary Size
Bengali(w) & Bengali(f) bn 300 147M 10059
Catalan(w) & Catalan(f) ca 300 967M 50013
Chinese(w) & Chinese(f) zh 300 1G 50101
Danish(w) & Danish(f) da 300 295M 30134
Dutch(w) & Dutch(f) nl 300 1G 50160
Esperanto(w) & Esperanto(f) eo 300 1G 50597
Finnish(w) & Finnish(f) fi 300 467M 30029
French(w) & French(f) fr 300 1G 50130
German(w) & German(f) de 300 1G 50006
Hindi(w) & Hindi(f) hi 300 323M 30393
Hungarian(w) & Hungarian(f) hu 300 692M 40122
Indonesian(w) & Indonesian(f) id 300 402M 30048
Italian(w) & Italian(f) it 300 1G 50031
Japanese(w) & Japanese(f) ja 300 1G 50108
Javanese(w) & Javanese(f) jv 100 31M 10019
Korean(w) & Korean(f) ko 200 339M 30185
Malay(w) & Malay(f) ms 100 173M 10010
Norwegian(w) & Norwegian(f) no 300 1G 50209
Norwegian Nynorsk(w) & Norwegian Nynorsk(f) nn 100 114M 10036
Polish(w) & Polish(f) pl 300 1G 50035
Portuguese(w) & Portuguese(f) pt 300 1G 50246
Russian(w) & Russian(f) ru 300 1G 50102
Spanish(w) & Spanish(f) es 300 1G 50003
Swahili(w) & Swahili(f) sw 100 24M 10222
Swedish(w) & Swedish(f) sv 300 1G 50052
Tagalog(w) & Tagalog(f) tl 100 38M 10068
Thai(w) & Thai(f) th 300 696M 30225
Turkish(w) & Turkish(f) tr 200 370M 30036
Vietnamese(w) & Vietnamese(f) vi 100 74M 10087

License

WordVectorModels
Copyright (c) 2018 Kyubyong Park, Astro

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

WordVectorModels is forked from wordvectors.

About

Pre-trained Word Vector Models of 30+ Languages

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.5%
  • Shell 21.5%