Pretrained word embeddings (aka word vectors) take a lot of space and memory to process. Moreover, many of these pretrained embeddings come in .bin format with different data layouts or structure (GloVe vs word2vec). These scripts allow you to convert .bin embeddings to .t7 format for easy load and use in Torch. In addition you can reduce the size of .t7 file by fitting to your training corpus vocabulary.
The script requires ~4.5GB free RAM unless you use [-r|--reduce]
parameter.
{
i2w -- {idx: token}
tensor -- FloatTensor - size: vocabsize x 300
w2i -- {token: idx}
}
Convert all word2vec embeddings to .t7
th word2vec.lua GoogleNews-vectors-negative300.bin
Extract and convert to .t7 only for tokens in your training corpus
th word2vec.lua filename.bin -r /path/to/corpus
Extract and print tokens only
th word2vec.lua filename.bin -t
Extract and print tokens + their corresponding vector represenataions to stdout
th word2vec.lua filename.bin -tv
If your /path/to/corpus
contains several .txt files (train.txt
, valid.txt
, test.txt
) then the script will read each and create a cumulative vocabulary.
-
Download Google News (word2vec) (3.4 GB)
-
Download GloVe - Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB)
-
Download GloVe - Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB)]