Skip to content
This repository has been archived by the owner on Jan 26, 2021. It is now read-only.

Word Embedding

Liming Huang edited this page Jul 29, 2016 · 55 revisions

Word Embedding

The DMTK Word Embedding tool is a parallelization of the Word2Vec algorithm on top of Multiverso. It provides an efficient "scaling to industry size" solution for word embedding.

How to install

Linux Installation

cd multiverso/Applications/WordEmbedding		
		
cmake CMakeLists.txt		
		
make		

Windows Installation

  1. Get and build the DMTK Framework Multiverso.

  2. Open Multiverso.sln, change configuration and platform to Release and x64, set the include and lib path of multiverso in WordEmbedding project property.

  3. Enable openmp 2.0 support.

  4. Build the solution.

How to run

For single machine training, run

WordEmbedding -param_name param_name 

To run in a distributed environment, run with MPI,you need to determine which machines you will used,and make a machine file like that:

10.153.151.126
10.153.151.127
10.153.151.128
10.153.151.129

And you need to setup a listening process in every machine:

smpd -d -p port_number

you can create a run.bat like that and run it:

param_name = param_value

mpirun -m machine_file_path -port port_number WordEmbedding -param_name param_name 

Parameters Setting

There are some parameters that need to set in Word2Vec experiments.Here is an example to show the format of run.bat. Suppose we're going to train a cbow model which negative number is 5 with 300-dimensions word embedding.

set size=300
set text=enwiki2014
set read_vocab="D:\Users\xxxx\Run\enwiki2014_vocab_m5.txt"
set train_file="D:\Users\xxxx\Run\enwiki2014"
set binary=1
set cbow=1
set alpha=0.05
set epoch=20
set window=5
set sample=0.001
set hs=0
set negative=5
set threads=15
set mincount=5
set sw_file="D:\Users\xxxx\Run\stopwords_simple.txt"
set stopwords=5
set data_block_size=100000000
set max_preload_data_size=300000000
set use_adagrad=0
set output=%text%_%size%.bin
set log_file= %text%_%size%.log
set is_pipeline=1

mpiexec.exe -machinefile machine_file.txt -port 9141 WordEmbedding.exe -is_pipeline %is_pipeline% -max_preload_data_size %max_preload_data_size% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad%  2>&1 1>%log_file%
  • It need to mentioned that training file train_file like enwiki2014 should be divided into n small dataset and put it on different machines.Every machine own the whole vocab dictionary read_vocab and stop word file sw_file.

Basic Model Setting

  • size , word embedding size.
  • cbow , 0 or 1, default 1, whether to use cbow, otherwise skip-gram.
  • alpha , initial learning rate, usually set to 0.025.
  • window , the window size.
  • sample , the sub - sample size, default is 1e-3.
  • hs , 0 or 1,default 1, whether to use hierarchical softmax, otherwise negative-sampling. when hs = 1,negative must be 0.
  • negative , the negative word count in negative sampling, please set it to 0 when hs = 1.
  • min_count , words with lower frequency than min_count is removed from dictionary.
  • use_adagrad , 0 or 1, whether to use adagrad to adjust learning rate.

File Setting

  • train_file , the training corpus file, e.g.enwik2014.
  • read_vocab , the file to read all the vocab counts info.
  • binary , 0 or 1,indicates whether to write all the embeddings vectors into binary format.
  • output , the output file to store all the embedding vectors.
  • stopwords , 0 or 1, whether to avoid training stop words.
  • sw_file , the stop words file storing all the stop words, valid when stopwords = 1.

Other Configuration

  • is_pipeline , 0 or 1, whether to use pipeline.
  • threads , the thread number to run in one machine.
  • epoch , the epoch number.
  • data_block_size , default 1MB, the maximum bytes which a data block will store.
  • max_preload_data_size , default 8GB, the maximum data size(bytes) which program will preload. It could help you control memory efficiently.
  • server_endpoint_file , server ZMQ socket endpoint file in MPI - free version.

Output File Format

The final word embedding will save in rank 0 machine. Below is an example of output file. You could use the word embedding in other tasks easily.All of values are separated by whitespace.

word_number_m word_embeeding_size_n
word_name_1 dimension_1_of_word_1 dimension_2_of_word_1 ... dimension_n_of_word_1
word_name_2 dimension_1_of_word_2 dimension_2_of_word_2 ... dimension_n_of_word_2
word_name_3 dimension_1_of_word_3 dimension_2_of_word_3 ... dimension_n_of_word_3
...
word_name_m dimension_1_of_word_m dimension_2_of_word_m ... dimension_n_of_word_m

Performance

We report the performance of the DMTK Word Embedding tool on the English versions of Wiki2014 which contains 340,288,3423 tokens. The performances of DMTK Word Embedding are given as follows. The experiments are run on 20 cores of Intel Xeon E5-2670 CPU on each machine.

Progarm Name Dimension Machine Analogical Reasoning WS353 Time
Google Word2Vec 300 1 64.6% 64.4% 52340s
DMTK Word2Vec 300 4 65.3% 75.1% 23484s

* The dataset statistics are got after data preprocessing.

* Analogical reasoning is evaluated by accuracy.

* WS353 is evaluated by Spearman's Rank.

* All the above experiments were run with the configuration like that:-cbow 1 -size 300 -alpha 0.05 -epoch 20 -window 5 -sample 0.0001 -hs 0 -negative 5 -mincount 5 -use_adagrad 0. For DWE, the data block size is set as -data_block_size 100000000(100MB).

* Take the best result in 20 epoch in experiments for the final result.

Convergence is as follows:

Analogical Reasoning google vs dmtk.png

WS 353 google vs dmtk.png

Traning Suggestion

For High Quality Of Word Embedding

  • Adjust the learning rate alpha. You can try different learning rate according to convergence of every epoch.

  • For small dataset, you can try Skip-gram and setcbow = 0. For large dataset, you can try cbow = 1.

  • A small sample value may improve the performance of word embedding.

  • The init value of word embedding are randomly sample from Uniform[-0.5/embedding_size , 0.5/embedding]. Please consider before you change it.

  • Finally, dataset is really important.

For Speed

  • Hierarchical softmax could get high quality word embedding in few epoch. But negative sample is faster and their performance are almost same when converged. You can set hs = 0 and change negative.

  • is_pipeline = 1 means the training model will train and request parameters in parallel. Which could help to reduce the training time.

  • You could try setting a larger number of threads with the risk of lower accuracy.

References

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient Estimation of Word Representations in Vector Space.In Proceedings of Workshop at ICLR, 2013.