Skip to content

GSOC 2017 project ideas

Menshikh Ivan edited this page Jan 19, 2018 · 13 revisions

A list of ideas for Google Summer of Code 2017 of new functionality and projects in Gensim, topic modelling for humans, a scientific Python package for efficient, large-scale topic modeling.

The ideas are listed in the order of descending importance to the project.

Potential mentors: Radim Rehurek, Stephen Kwan-Yuet Ho and Lev Konstantinovskiy. More mentors coming soon.

If you'd like to work on any of the topics below, or have your own ideas, get in touch at student-projects@rare-technologies.com.

Visualizations

Training and Topic Visualization

Difficulty:Intermediate.

Background: Gensim's motto is "topic modelling for humans". Humans like visualizations, topic modelling and word embeddings lend themselves to multi-dimensional diagrams, pie charts and document graphs very naturally.

To do:

Part 1. Feed stats to TensorBoard during training.

Google's Tensorflow has a very nice web UI called Tensorboard. Show gensim's on-going training statistics in it. Perplexity and coherence for LDA. Loss for word2vec/doc2vec.

Part 2. Drill-down view of topics/document clusters.

LDA produces topics, doc2vec and word2vec give document clusters. Vizualization is a very efficient ways to explore them.

Example plots: display how topics are composed of words. How documents are composed of topics. How corpus is composed of topics. Make the visualization interactive -- go from words to topics, explore the model.

Technologies to use: d3.js, or Continuum's Bokeh.

Resources: Survey of LDA visualisations(in Russian) by Aysina R.M., Jason Chuang's Termite package. Allison Chaney's TMVE, Topic Model Visualization Engine. pyLDAvis which has been ported from the R package of the same name.

pca vizualization showing linear substructure in glove. Here is a link to how it is made

text viz collection

Performance

Performance improvements in Gensim and FastText

Difficulty:Intermediate.

Background: Facebook AI Research has released a new word embedding FastText. It performs better than old popular Word2vec on syntactic tasks and supports out-of-vocabulary words. The initial implementation was in C++, later @giacbrd produced a Python implementation building on top of Gensim code.

Python performance can be greatly improved by re-writing frequently used and time consuming part in a Cython format that compiles to C. For example, that is how the Gensim word2vec algorithm in Python became faster than in C

To do:

Refactoring, performance improvement and evaluation of FastText in Python(aka Labeled word2vec). Distributed, multicore and GPU versions. Some parts will need to be re-written in Cython and Tensorflow.

Cythonize gensim phrases module and other easy bottlenecks. Also cythonize FastText in Python code to match FastText’s C++ performance.

Resources:

Labeled Word2vec PR by @giacbrd in gensim

Official FastText discussion group on Facebook

Integrations

Gensim integration with scikit-learn and Keras. Joint project with ShortText package.

Difficulty:Intermediate.

Background: Gensim is a package for unsupervised learning. That means that in order to apply it to a business problem it's output must go to a supervised classifier. Most popular supervised learning packages are scikit-learn and Keras for neural networks. ShortText package already provides some integration but it is a 3 month old project with a lot of room for contributions. This will be a joint project with ShortText author Stephen Kwan-Yuet Ho.

To do: Create a scikit-learn wrapper around all Gensim models to allow their use in a scikit-learn pipeline similar to the wrapper for LDAModel

Also see the TODO page of ShorText for the tasks.

Resources: ShortText,Keras, scikit-learn, Gensim for movie plot classification tutorial

Distributed Word2vec on CPUs on multiple machines

Difficulty:Intermediate.

Background: Gensim contains distributed implementations of several algorithms. The implementations use Pyro4 for network communication and are fairly low-level.

To do: Re-implement gensim word2vec algorithm in Tensorflow or Spark in order to enable distributed computation. During log model's log in a way that can be displayed in Tensorboard UI.

Resources: TensorFlow word2vec implementation, Gensim word2vec implementation

Word2vec on single/multiple GPUs on the same machine

Difficulty:Hard.

Background: Many frameworks have tried to make word2vec run faster by using the GPU but most of the time it runs slower on GPU than on CPU due to the large memory requirements and not all operations being placed on GPU. There are implementations in DL4J, Tensorflow and Keras but only BIDmach implementation runs faster than on CPU.

To do: Use Tensorflow framework to utilise a single/multiple GPUs on a single machine.

Make sure that all operations are being placed on gpu to avoid expensive I/O between CPU and GPU. Make sure that batch size is optimal to use all of the GPU memory and minimize number of times that GPU memory is being written into.

If Tensorflow performance is not satisfactory than use a lower level GPU framework,

Optionally make it support multiple nodes i.e. distributed.

Resources:

Paper benchmarking BIDMach on GPU http://people.eecs.berkeley.edu/~jfc/papers/15/PID3922265.pdf, blog showing that word2vec on Keras is slower on GPU than on CPU,

New algorithms

Online NNMF

Difficulty:Hard.

Background: Non-negative matrix factorization is an algorithm similar to Latent Semantic Analysis/Latent Dirichlet Allocation. It falls into matrix factorization methods and can be phrased as an online learning algorithm.

To do: Based on existing online parallel implemenation in libmf, implement NNMF in Pytohn/Cython in gensim and evaluate. Must support multiple cores on the same machine. Optionally also implement a distributed cluster version using Tensorflow.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Online NMF. Gensim github issue #132. NMF in scikit-learn(not online, not out-of-core), Libmf paper

Supervised LDA

Difficulty:Easy.

Background: Many users have requested "supervised Latent Dirichlet Allocation, LDA" in gensim.

To do: Implement sLDA in a scalable manner in gensim and evaluate. Also implement a distributed cluster version on Tensorflow, and a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Evaluation can use the Lee corpus of human similarity judgements included in gensim, or evaluate in some other way.

Resources: Wang&Blei&Fei's sLDA paper.

Ramage &al's Labeled LDA.

Jagarlamundi's Seeded LDA

Implementation of sLDA on David Blei's page.

Implementation of SLDA in Python with Cython

Gensim github issue #121.

Word2Vec/Doc2Vec: Implement 'Translation Matrix' of 'Exploiting similarities among languages for machine translation'

Difficulty:Easy.

Background: Section 4 of Mikolov, Le, & Sutskever's paper on word2vec for machine translation describes a way to map words between two separate vector models, as in the example of word vectors induced for two different natural languages.

Section 2.2 of 'Skip-Thought Vectors' uses a similar technique to bootstrap a larger vocabulary in their model, from a pre-existing larger word2vec model.

The same technique could be valuable for adapting to drifting word representations, when training over large datasets over long timeframes. Specifically: as new information introduces extra words, and newer examples of word usage, older words may (and probably should) relocate for the model to continue to perform optimally on the training task, on more-recent text. (In a sense, words should rearrange to 'make room' for the new words and examples.) As these changes accumulate, older representations (or cached byproducts) may not be directly comparable to the latest representations – unless a translation-matrix-like adjustment is made. (The specifics of the translation may also indicate areas of interest, where usage or meanings are changing rapidly.)

Implementation work by Georgiana Dinu, linked from the word2vec homepage, may be relevant if license-compatible. (Update: In correspondence, Dinu has given approval to re-use that code in gensim, if it's helpful.)

Implementation with normal equations. In paper by Andrey Kutuzov this was successfully used with Gensim to `translate' between Ukrainian and Russian. Code. Can be easily integrated into Gensim.

Jason of jxieeducation.com blog has also run an experiment suggesting the usefulness of this approach, in this case using sklearn's Linear Regression to learn the projection.

The Procrustes matrix alignment example code by Ryan Heuser based on HistWords by William Hamilton does something similar and may be of direct use, or use as a model.