Skip to content

Roadmap (2019)

Radim Řehůřek edited this page Mar 15, 2019 · 5 revisions

(copy-pasted from Roadmap 2018, same priorities – not much progress in 2018)

Intro

At the start of 2018, Gensim is really big and diverse. Gensim has become one of the main libraries for doing NLP and text modelling in Python. We already have a large number of useful models, and now focus on improving their quality (documentation and code) and performance (multi-core and Cythonized; Gensim targets real tasks, with real large-scale data, not "toy demonstrations").

Gensim objectives in 2018

  1. Clean up the house

    • Remove / fix broken packages (like the summarization subpackage) and obsolete stuff (like examples/dmlcz)
    • Create a new experimental subpackage for interesting but unpolished stuff. Simple rule: if an algorithm can't be applied to large corpora, for example Wikipedia or similar by size or it is not stable (many suspicious bugs, poor code quality), it to experimental or remove.
    • Documentation project
      • [WIP] Docstrings for all stuff in gensim
      • New "beginner tutorial chain" (persistent on site and in repository)
      • User-guides for all stuff (sphinx-gallery)
      • New documentation website
      • [WIP] New structure of documentation
  2. Neural Networks for NLP

    The field of neural networks is developing very rapidly (including for NLP). As a downside, there's a lot of hype and bullshit publish-or-perish, unreproducible results. We would like to identify the architectures and models that are actually useful, and implement them in Gensim. Gensim stands for generate similar, so we're interested in anything that allows us to assess the similarity of two documents (regardless of the metric used, and regardless of whether it uses vectors as an intermediate text representation or not). See also similarity learning.

    This is a more open-ended goal, and having a clear, quantifiable industry use-case will be critical. Examples include the Quora duplicate dataset, QA datasets etc.

    Conditions:

    • Start from a concrete field of application, a motivating example along with evaluation process.
    • Preference for existing NN libraries such as Keras or PyTorch (don't implement from scratch)
    • Simple Gensim-style API (input streaming support, model serialization, etc)
    • Clear tutorial on why and how to use the model, in addition to technical docstrings.
  3. Discoverability and communication

    Better communication with our users -- share what is happening and why, on a regular and frequent basis. Right now, the development process is opaque, with many cool features and incremental improvements hidden in obscure code comments and notebooks, and never heard about by people who'd actually use them or could give us feedback. For example, Gensim includes a cool WMD (word movers distance) implementation, but doesn't even come up in Google results for "wmd in python".

    We want to encourage and engage our community more, both in terms of direct contributions (code, documentation, reviews, testing) and spreading the word.

    Possible tools:

    • Twitter @gensim_py
    • Gensim mailing list
    • Our RARE newsletter
    • a clear description of the contribution process (in explicit steps): what we need, what are our standards, where to go to learn more / ask for help
    • other?