Skip to content

blei-lab/lda-bump-cpp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lda-bump-cpp

Latent Dirichlet allocation (LDA) with bumping variational inference.

Implements three versions of LDA

[1] Alp Kucukelbir and David M Blei. Population Empirical Bayes. Uncertainty in Artificial Intelligence (UAI) 2015.

Requirements

lda-bump-cpp is written in C++11. It requires a modern compiler. It also depends on Eigen 3, Boost, and CMake. It uses docopt (provided).

Refer to platform-specific instructions for installation. (I recommend homebrew on Mac OS X.)

Instructions to Build and Run

The driver (main) program runs all three algorithms.

cmake .
make driver

A toy dataset of arXiv abstracts is provided.

Example

./driver --topics=5
         --vocabulary=data/arxiv-vocab.dat
         --datatr=data/arxiv-train-5k.dat
         --datatest=data/arxiv-test-1k.dat

For more help, run

./driver -h

lda-bump-cpp LDA with bumping variational inference.

Usage:
  driver --topics=NUM_TOPICS --vocabulary=VOCAB
         --datatr=TRAIN --datatest=TEST
         [--bootstrap=NUM_BOOTSTRAP] [--minibatch=MINIBATCH]
         [--alpha=ALPHA] [--eta=ETA]
         [--tau0=TAU0] [--kappa=KAPPA]
         [--fixed_step_size=STEPSIZE]
         [--max_itr=MAX_ITR]
         [--compute_elbo]
  driver (-h | --help)
  driver --version

Options:
  --topics=NUM_TOPICS        Number of topics for LDA
  --vocabulary=VOCAB         Vocabulary, one word per line
  --datatr=TRAIN             Training data in LDA-C format
  --datatest=TEST            Testing  data in LDA-C format
  --bootstrap=NUM_BOOTSTRAP  Number of bootstraps for bumping [default: 10]
  --minibatch=MINIBATCH      Number of docs in minibatch [default: 500]
  --alpha=ALPHA              Hyperparameter on topic proportions [default: 1/K]
  --eta=ETA                  Hyperparameter on topics [default: 100/V]
  --tau0=TAU0                Learning rate delay [default: 10.0]
  --kappa=KAPPA              Learning rate forgetting rate [default: 0.75]
  --fixed_step_size=STEPSIZE Fixed stepsize instead RobMonro [default: 0.0]
  --max_itr=MAX_ITR          Max number of iterations for LDA [default: 100]
  --compute_elbo             Boolean flag for computing ELBO
  -h --help                  Show this screen
  --version                  Show version

Vocabulary Data Format

A text file with each word ([term_1] through [term_N]) on a separate line.

Corpus Data Format

A text file where each line is of the form (the LDA-C format):

[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

Visualizing the Output

A python script visualizes the topics. (Modified from https://github.com/Blei-Lab/onlineldavb)

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_coord_ascent.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_svi.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_bumping.dat

About

Latent Dirichlet allocation (LDA) with bumping variational inference.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 96.7%
  • Python 2.8%
  • CMake 0.5%