Word Vector Models

Pre-trained Word Vector Models of 30+ Languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Requirements

Required Python 3.

See requirements

Language Supports

If using below language, you also need additional library.

Chinese: jieba
Japanese: mecab
Korean: konlpy
Thai: pythai
Vietnamese: pyvi

Word Embeddings

If training your own model, you have to install below library.

fastText: fasttext
Word2Vec: gensim

Background / References

Check this to know what word embedding is.
Check this to quickly get a picture of Word2vec.
Check this to install fastText.
Watch this to really understand what's happening under the hood of Word2vec.
Go get various English word vectors here if needed.

Work Flow

STEP 1-1. Download the wikipedia database backup dumps of the language you want (For example, for english wiki go to https://dumps.wikimedia.org/enwiki/, click the latest timestamp, and download the enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2 file).
STEP 1-2. Install requirements packages.
STEP 2. Extract running texts to ./data directory.
STEP 3. Run ./src/build_corpus.py.

python build_corpus.py --lcode=ko

STEP 4-1. Run ./src/train_word2vec.py to get Word2Vec word vectors:

python train_word2vec.py --lcode=en --vector_size=300 --window_size=5 --vocab_size=50000 --num_negative=5

STEP 4-2. Run ./script/fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Check language code here.

Language	ISO 639-1	Vector Size	Corpus Size	Vocabulary Size
Bengali(w) & Bengali(f)	bn	300	147M	10059
Catalan(w) & Catalan(f)	ca	300	967M	50013
Chinese(w) & Chinese(f)	zh	300	1G	50101
Danish(w) & Danish(f)	da	300	295M	30134
Dutch(w) & Dutch(f)	nl	300	1G	50160
Esperanto(w) & Esperanto(f)	eo	300	1G	50597
Finnish(w) & Finnish(f)	fi	300	467M	30029
French(w) & French(f)	fr	300	1G	50130
German(w) & German(f)	de	300	1G	50006
Hindi(w) & Hindi(f)	hi	300	323M	30393
Hungarian(w) & Hungarian(f)	hu	300	692M	40122
Indonesian(w) & Indonesian(f)	id	300	402M	30048
Italian(w) & Italian(f)	it	300	1G	50031
Japanese(w) & Japanese(f)	ja	300	1G	50108
Javanese(w) & Javanese(f)	jv	100	31M	10019
Korean(w) & Korean(f)	ko	200	339M	30185
Malay(w) & Malay(f)	ms	100	173M	10010
Norwegian(w) & Norwegian(f)	no	300	1G	50209
Norwegian Nynorsk(w) & Norwegian Nynorsk(f)	nn	100	114M	10036
Polish(w) & Polish(f)	pl	300	1G	50035
Portuguese(w) & Portuguese(f)	pt	300	1G	50246
Russian(w) & Russian(f)	ru	300	1G	50102
Spanish(w) & Spanish(f)	es	300	1G	50003
Swahili(w) & Swahili(f)	sw	100	24M	10222
Swedish(w) & Swedish(f)	sv	300	1G	50052
Tagalog(w) & Tagalog(f)	tl	100	38M	10068
Thai(w) & Thai(f)	th	300	696M	30225
Turkish(w) & Turkish(f)	tr	200	370M	30036
Vietnamese(w) & Vietnamese(f)	vi	100	74M	10087

License

WordVectorModels
Copyright (c) 2018 Kyubyong Park, Astro

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

WordVectorModels is forked from wordvectors.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Language.md		Language.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

Language.md

Language.md

README.md

README.md

Repository files navigation

Word Vector Models

Requirements

Language Supports

Word Embeddings

Background / References

Work Flow

Pre-trained models

License

About

Releases

Packages

Languages

License

Astro36/word-vector-models

Folders and files

Latest commit

History

Repository files navigation

Word Vector Models

Requirements

Language Supports

Word Embeddings

Background / References

Work Flow

Pre-trained models

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages