Ideas Scrapyard

This page created only as "history artifact", contains many ideas that non-relevant now.

Supervised Latent Dirichlet Allocation

Note: Consider integration with existing Python sLDA

Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.

In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.

While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.
Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.

Resources:

[1] Mcauliffe, Jon D., and David M. Blei. "Supervised topic models." Advances in neural information processing systems. 2008.

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[3] sLDA implementation in C++

[4] Implementation of sLDA in R

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[6] Gensim github issue #121.

[7] Gensim on github

[8] Movie Review Dataset from Cornell NLP group

[9] Ramage, Daniel, et al. "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009.

[10] Labelled LDA in Python

[11] Jagarlamudi, Jagadeesh, Hal Daumé III, and Raghavendra Udupa. "Incorporating lexical priors into topic models." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012

Distributed similarity queries

Background: gensim implements fast routines for similarity retrieval ("give me documents similar to this one, using Latent Semantic Analysis"). The routines can make use of multiple cores (using BLAS), but not multiple machines. For large datasets, it is desirable to store shards in a distributed manner, across a cluster of computers. During querying, collect and merge results from all shards.

To do: Extend the sharding already present in gensim, so that different shards can reside on different computers. Design an API to make the concept of "shards" flexible, so that similarity classes that use different implementations (see k-NN above) can plug into it easily.

The network communication must use a fast protocol (Pyro? ØMQ?), so as to not increase query latency too much.

Resources: gensim mailing list.

Dynamic Topic Models improvements

Background: Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents. DTM has already been implemented in Gensim by Google Summer of Code student @bhargavvader

To do: Implement a distributed cluster version of DTM, and a version that can use multiple cores on the same machine. Implement DIM in gensim and evaluate.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.

Resources: Dynamic Topic Models.

Original Blei&Lafferty article PDF.

Wang&Blei&Heckerman article on Continuous Time Dynamic Topic Model PDF.

Wang&McCallum: "Topics over time" PDF.

Academic implementation of DTM on David Blei's page.

Gensim implementation of DTM.

Nested Hierarchical Dirichlet Processes

Background: Paisley, Wang, Blei, Jordan recently developed a stochastic variational version of nested HDP. It reportedly preforms better than HDP etc. (of course!).

To do: Implement this model (probably extending / replacing the existing online HDP implementation in gensim) and evaluate it. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors), to allow large inputs.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "Nested Hierarchical Dirichlet Processes" by John Paisley, Chong Wang, David M. Blei and Michael I. Jordan PDF.

Pachinko Allocation Model

Background Li, McCallum developed a hierarchical LDA-like model for document classification. They report 2-5% accuracy improvements over an LDA model on a test corpus. (http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf)

An implementation of this model may provide additional alternatives in choice of model, which in some cases may be helpful.

An implementation must be heavily unit tested and and production-ready. It would use many of the same classes and methods as the LDA, which is a bonus in terms of a first pass at implementation.

Resources Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS.

Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society B, 46, 193–227.

Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation.

Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML.

Minka, T. (2000). Estimating a Dirichlet distribution. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. UAI.

Wallach, H. M. (2006). Topic modeling: beyond bag-ofwords. ICML.

Glove word-embedding integration

Integrate or re-write in an optimized way the glove word-embedding code by Maciej Kula (https://github.com/maciejkula/glove-python). Next step would be adding Swivel algorithm support