word-embeddings-for-Torch

Pretrained word embeddings (aka word vectors) take a lot of space and memory to process. Moreover, many of these pretrained embeddings come in .bin format with different data layouts or structure (GloVe vs word2vec). These scripts allow you to convert .bin embeddings to .t7 format for easy load and use in Torch. In addition you can reduce the size of .t7 file by fitting to your training corpus vocabulary.

The script requires ~4.5GB free RAM unless you use [-r|--reduce] parameter.

Torch (.t7) file output format

{
  i2w -- {idx: token}
  tensor -- FloatTensor - size: vocabsize x 300
  w2i -- {token: idx}
}

Usage

Convert all word2vec embeddings to .t7

th word2vec.lua GoogleNews-vectors-negative300.bin

Extract and convert to .t7 only for tokens in your training corpus

th word2vec.lua filename.bin -r /path/to/corpus

Extract and print tokens only

th word2vec.lua filename.bin -t

Extract and print tokens + their corresponding vector represenataions to stdout

th word2vec.lua filename.bin -tv

If your /path/to/corpus contains several .txt files (train.txt, valid.txt, test.txt) then the script will read each and create a cumulative vocabulary.

Available Converters

Word Embeddings

Download Google News (word2vec) (3.4 GB)
Download GloVe - Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB)
Download GloVe - Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB)]

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
README.md		README.md
glove.lua		glove.lua
word2vec.lua		word2vec.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

glove.lua

glove.lua

word2vec.lua

word2vec.lua

Repository files navigation

word-embeddings-for-Torch

Torch (.t7) file output format

Usage

Available Converters

Word Embeddings

About

Releases

Packages

Languages

tastyminerals/word-embeddings-for-Torch

Folders and files

Latest commit

History

Repository files navigation

word-embeddings-for-Torch

Torch (.t7) file output format

Usage

Available Converters

Word Embeddings

About

Topics

Resources

Stars

Watchers

Forks

Languages