WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

Introduction

WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).

Installation

Prerequisites:

GCC (>=4.8.5)
CMake (>=2.8.12)
git
libnuma
- CentOS: yum install libnuma-devel
- Ubuntu: apt-get install libnuma-dev

Clone this project

git clone https://github.com/thu-ml/warplda

Install third-party dependency

./get_gflags.sh

Download some data, and split it as training and testing set

cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

./build.sh
cd release/src
make -j

Quick-start

Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.

vim train.info.full.txt

Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

`.vocab` (generated by `.format`)

Each line of it is a word in the vocabulary.

`.info.full.txt` (generated by `warplda -estimate`)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

`.model` (generated by `warplda -estimate`)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

`.z.estimate` (generated by `warplda -estimate`)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

`.z.inference` (generated by `warplda -inference`)

The format is the same as .z.estimate.

Other features

Use custom prefix for output -prefix myprefix
Output perplexity every 10 iterations -perplexity 10
Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

Use UCI machine learning repository data

  wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
  wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
  gunzip docword.nips.txt.gz
  ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
  head -n 1400 nips.txt > nips_train.txt
  tail -n 100 nips.txt > nips_test.txt

License

MIT

Reference

Please cite WarpLDA if you find it is useful!

@inproceedings{chen2016warplda,
  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
  booktitle={VLDB},
  year={2016}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
get_gflags.sh		get_gflags.sh

License

thu-ml/warplda

Folders and files

Latest commit

History

Repository files navigation

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

Introduction

Installation

Quick-start

Data format

Output format

.vocab (generated by .format)

.info.full.txt (generated by warplda -estimate)

.model (generated by warplda -estimate)

.z.estimate (generated by warplda -estimate)

.z.inference (generated by warplda -inference)

Other features

License

Reference

About

Resources

License

Stars

Watchers

Forks

Languages

`.vocab` (generated by `.format`)

`.info.full.txt` (generated by `warplda -estimate`)

`.model` (generated by `warplda -estimate`)

`.z.estimate` (generated by `warplda -estimate`)

`.z.inference` (generated by `warplda -inference`)