Pre-trained Word Vector Models of 30+ Languages
This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.
Required Python 3.
See requirements
If using below language, you also need additional library.
- Chinese:
jieba
- Japanese:
mecab
- Korean:
konlpy
- Thai:
pythai
- Vietnamese:
pyvi
If training your own model, you have to install below library.
- fastText: fasttext
- Word2Vec:
gensim
- Check this to know what word embedding is.
- Check this to quickly get a picture of Word2vec.
- Check this to install fastText.
- Watch this to really understand what's happening under the hood of Word2vec.
- Go get various English word vectors here if needed.
- STEP 1-1. Download the wikipedia database backup dumps of the language you want (For example, for english wiki go to https://dumps.wikimedia.org/enwiki/, click the latest timestamp, and download the
enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2
file). - STEP 1-2. Install requirements packages.
- STEP 2. Extract running texts to
./data
directory. - STEP 3. Run
./src/build_corpus.py
.
python build_corpus.py --lcode=ko
- STEP 4-1. Run
./src/train_word2vec.py
to get Word2Vec word vectors:
python train_word2vec.py --lcode=en --vector_size=300 --window_size=5 --vocab_size=50000 --num_negative=5
- STEP 4-2. Run
./script/fasttext.sh
to get fastText word vectors.
Two types of pre-trained models are provided. w
and f
represent word2vec
and fastText
respectively.
Check language code here.
Language | ISO 639-1 | Vector Size | Corpus Size | Vocabulary Size |
---|---|---|---|---|
Bengali(w) & Bengali(f) | bn | 300 | 147M | 10059 |
Catalan(w) & Catalan(f) | ca | 300 | 967M | 50013 |
Chinese(w) & Chinese(f) | zh | 300 | 1G | 50101 |
Danish(w) & Danish(f) | da | 300 | 295M | 30134 |
Dutch(w) & Dutch(f) | nl | 300 | 1G | 50160 |
Esperanto(w) & Esperanto(f) | eo | 300 | 1G | 50597 |
Finnish(w) & Finnish(f) | fi | 300 | 467M | 30029 |
French(w) & French(f) | fr | 300 | 1G | 50130 |
German(w) & German(f) | de | 300 | 1G | 50006 |
Hindi(w) & Hindi(f) | hi | 300 | 323M | 30393 |
Hungarian(w) & Hungarian(f) | hu | 300 | 692M | 40122 |
Indonesian(w) & Indonesian(f) | id | 300 | 402M | 30048 |
Italian(w) & Italian(f) | it | 300 | 1G | 50031 |
Japanese(w) & Japanese(f) | ja | 300 | 1G | 50108 |
Javanese(w) & Javanese(f) | jv | 100 | 31M | 10019 |
Korean(w) & Korean(f) | ko | 200 | 339M | 30185 |
Malay(w) & Malay(f) | ms | 100 | 173M | 10010 |
Norwegian(w) & Norwegian(f) | no | 300 | 1G | 50209 |
Norwegian Nynorsk(w) & Norwegian Nynorsk(f) | nn | 100 | 114M | 10036 |
Polish(w) & Polish(f) | pl | 300 | 1G | 50035 |
Portuguese(w) & Portuguese(f) | pt | 300 | 1G | 50246 |
Russian(w) & Russian(f) | ru | 300 | 1G | 50102 |
Spanish(w) & Spanish(f) | es | 300 | 1G | 50003 |
Swahili(w) & Swahili(f) | sw | 100 | 24M | 10222 |
Swedish(w) & Swedish(f) | sv | 300 | 1G | 50052 |
Tagalog(w) & Tagalog(f) | tl | 100 | 38M | 10068 |
Thai(w) & Thai(f) | th | 300 | 696M | 30225 |
Turkish(w) & Turkish(f) | tr | 200 | 370M | 30036 |
Vietnamese(w) & Vietnamese(f) | vi | 100 | 74M | 10087 |
WordVectorModels
Copyright (c) 2018 Kyubyong Park, Astro
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
WordVectorModels is forked from wordvectors.