lda-bump-cpp

Latent Dirichlet allocation (LDA) with bumping variational inference.

Implements three versions of LDA

Coordinate ascent (mean-field)
Stochastic variational inference (https://github.com/Blei-Lab/onlineldavb)
Bumping variational inference [1]

[1] Alp Kucukelbir and David M Blei. Population Empirical Bayes. Uncertainty in Artificial Intelligence (UAI) 2015.

Requirements

lda-bump-cpp is written in C++11. It requires a modern compiler. It also depends on Eigen 3, Boost, and CMake. It uses docopt (provided).

Refer to platform-specific instructions for installation. (I recommend homebrew on Mac OS X.)

Instructions to Build and Run

The driver (main) program runs all three algorithms.

cmake .
make driver

A toy dataset of arXiv abstracts is provided.

Example

./driver --topics=5
         --vocabulary=data/arxiv-vocab.dat
         --datatr=data/arxiv-train-5k.dat
         --datatest=data/arxiv-test-1k.dat

For more help, run

./driver -h

lda-bump-cpp LDA with bumping variational inference.

Usage:
  driver --topics=NUM_TOPICS --vocabulary=VOCAB
         --datatr=TRAIN --datatest=TEST
         [--bootstrap=NUM_BOOTSTRAP] [--minibatch=MINIBATCH]
         [--alpha=ALPHA] [--eta=ETA]
         [--tau0=TAU0] [--kappa=KAPPA]
         [--fixed_step_size=STEPSIZE]
         [--max_itr=MAX_ITR]
         [--compute_elbo]
  driver (-h | --help)
  driver --version

Options:
  --topics=NUM_TOPICS        Number of topics for LDA
  --vocabulary=VOCAB         Vocabulary, one word per line
  --datatr=TRAIN             Training data in LDA-C format
  --datatest=TEST            Testing  data in LDA-C format
  --bootstrap=NUM_BOOTSTRAP  Number of bootstraps for bumping [default: 10]
  --minibatch=MINIBATCH      Number of docs in minibatch [default: 500]
  --alpha=ALPHA              Hyperparameter on topic proportions [default: 1/K]
  --eta=ETA                  Hyperparameter on topics [default: 100/V]
  --tau0=TAU0                Learning rate delay [default: 10.0]
  --kappa=KAPPA              Learning rate forgetting rate [default: 0.75]
  --fixed_step_size=STEPSIZE Fixed stepsize instead RobMonro [default: 0.0]
  --max_itr=MAX_ITR          Max number of iterations for LDA [default: 100]
  --compute_elbo             Boolean flag for computing ELBO
  -h --help                  Show this screen
  --version                  Show version

Vocabulary Data Format

A text file with each word ([term_1] through [term_N]) on a separate line.

Corpus Data Format

A text file where each line is of the form (the LDA-C format):

[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

Visualizing the Output

A python script visualizes the topics. (Modified from https://github.com/Blei-Lab/onlineldavb)

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_coord_ascent.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_svi.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_bumping.dat

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docopt		docopt
lda		lda
util		util
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
docopt.cpp		docopt.cpp
driver.cpp		driver.cpp
printtopics.py		printtopics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docopt

docopt

lda

lda

util

util

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

docopt.cpp

docopt.cpp

driver.cpp

driver.cpp

printtopics.py

printtopics.py

Repository files navigation

lda-bump-cpp

Requirements

Instructions to Build and Run

Vocabulary Data Format

Corpus Data Format

Visualizing the Output

About

Releases

Packages

Languages

License

blei-lab/lda-bump-cpp

Folders and files

Latest commit

History

Repository files navigation

lda-bump-cpp

Requirements

Instructions to Build and Run

Vocabulary Data Format

Corpus Data Format

Visualizing the Output

About

Resources

License

Stars

Watchers

Forks

Languages