Skip to content

tastyminerals/word-embeddings-for-Torch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 

Repository files navigation

word-embeddings-for-Torch

Pretrained word embeddings (aka word vectors) take a lot of space and memory to process. Moreover, many of these pretrained embeddings come in .bin format with different data layouts or structure (GloVe vs word2vec). These scripts allow you to convert .bin embeddings to .t7 format for easy load and use in Torch. In addition you can reduce the size of .t7 file by fitting to your training corpus vocabulary.

The script requires ~4.5GB free RAM unless you use [-r|--reduce] parameter.

Torch (.t7) file output format

{
  i2w -- {idx: token}
  tensor -- FloatTensor - size: vocabsize x 300
  w2i -- {token: idx}
}

Usage

Convert all word2vec embeddings to .t7

th word2vec.lua GoogleNews-vectors-negative300.bin  

Extract and convert to .t7 only for tokens in your training corpus

th word2vec.lua filename.bin -r /path/to/corpus

Extract and print tokens only

th word2vec.lua filename.bin -t

Extract and print tokens + their corresponding vector represenataions to stdout

th word2vec.lua filename.bin -tv

If your /path/to/corpus contains several .txt files (train.txt, valid.txt, test.txt) then the script will read each and create a cumulative vocabulary.

Available Converters

Word Embeddings

Releases

No releases published

Packages

No packages published

Languages