cyLDA

Overview

This is a Cython implementation of David Blei's variational inference code for LDA, available at https://github.com/blei-lab/lda-c

Note: My primary purpose here was to learn about Cython. Many more parts could be cythonized, but the slowest parts have been dealt with, and it is already competitive in terms of speed with the original C implementation. However, there is little reason to use this implementation compared to say, Mallet. Nevertheless, feel free to borrow or extend this code as you wish.

Requirements

python3
Cython
numpy
scipy
pandas
spaCy

Data format

This implementation expects three files for input:

train.items.json: a json file containing a list of D document IDs
train.vocab.json: a json file containing a list of V words
train.npz: a sparse DxV matrix of word counts saved as a npz file in scipy sparse COO format

preprocess_data.py will produce these files if given two json files (train and test), each of which should be a dictionary of documents, with each document's text in a field called 'text'.

As an example, first run download_20ng.py to download the benchmark 20 newsgroups dataset as sets of train and test files, and then run: preprocess_data.py data/20ng/20ng_all/train.json data/20ng/20ng_all/test.json data/20ng/20ng_all

Usage

Before fitting a model, it is necessary to compile the Cython code.

First, run: python setup.py build_ext --inplace which should compile cy_lda.pyx and util.pyx.

To fit an LDA model to data, run python fit.py input_dir model_dir

For example: python fit.py data/20ng/20ng_all data/20ng/20ng_all/model

To display the top words for each topic after each iteration add the keyword --display. Use -h for more options.

To evalute the perplexity on held out data (approximated using the variational lower bound), run python evalute.py data_dir model_dir

For example, python fit.py data/20ng/20ng_all data/20ng/20ng_all/model

References

Most of the variable names in this code match those used in the original LDA paper:

David M. Blei , Andrew Y. Ng , Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research. (2002)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
common.py		common.py
cy_lda.pyx		cy_lda.pyx
download_20ng.py		download_20ng.py
evaluate.py		evaluate.py
file_handling.py		file_handling.py
fit.py		fit.py
mallet_stopwords.txt		mallet_stopwords.txt
preprocess_data.py		preprocess_data.py
setup.py		setup.py
util.pxd		util.pxd
util.pyx		util.pyx

License

dallascard/cyLDA

Folders and files

Latest commit

History

Repository files navigation

cyLDA

Overview

Requirements

Data format

Usage

References

About

Resources

License

Stars

Watchers

Forks

Languages