Skip to content

Developer page

Michael Penkov edited this page May 14, 2023 · 110 revisions

This document contains guidelines and advice for Gensim contributors. If you're interested in contributing some functionality, please read this first.

Is my contribution a good fit?

Gensim's mission is to provide NLP practitioners (SW engineers, data scientists) with unsupervised learning of document representation and document similarity, especially on very large datasets.

This scope includes topic modelling, vector embeddings, fast document retrieval etc.

There are many worthwhile ideas in NLP not within Gensim's purview: supervised learning, unproven academic algorithms (not robust), slow algorithms (not practical).

If unsure, let us know at the Gensim mailing list first, we can discuss your idea there.

How do I contribute?

  • Bug fixes: leave a comment in its Gensim issue letting us (and others) know you're working on it. If there's no bug issue on Github for it yet, open one first. Please respect the issue template you'll see there: include all the necessary info, versions, HW, etc.
  • New functionality (new algorithm, module): Get feedback early on the Gensim mailing list, to avoid putting a lot of effort into something that the core maintainers will ultimately reject. What matters to us the most is clear motivation ("Why is this needed? Who needs it?") and a clean, maintainable implementation ("How"; we can help you with that).
  • Implement your contribution using the style guide below, including documentation and testing.
  • Open a Github PR and accept legal.

Please be patient. Gensim is a project run by volunteers – we all have our jobs, Gensim is a hobby. We'll get to your contribution faster if it's articulate and well-motivated.

Code style

Gensim automatically run CI tests, including for code style (tox -e flake8,flake8-docs):

  1. Strict PEP8, except we allow line length of up to 120 characters instead of just 80, where this improves readability.

    Use 4 spaces for indents, no tabs.

    Use only Python language constructs from officially supported Python versions.

  2. Hanging indent everywhere:

    my_list = [
        1, 2, 3, 4, 5,
        "some", "other", "elements",  # mind the trailing comma!
    ]

    No vertical indent please!

    my_list = [1, 2, 3, 4, 5,                # NO!
               "some", "other", "elements"]  # DISGUSTING!
              
  3. Hanging indent also in overlong function calls and fnc definitions:

    result = my_long_function(
        length, width, depth,
        name="hello", age=12.3,  # mind the trailing comma!
    )
    
    def my_super_function(
            person_name, another_parameter, third_parameter,
            default=None, last_param=frozenset([1, 2, 3]),  # mind the trailing comma!
        ):
        """Calculate the gizmo for all things.
    
        Parameters
        ----------
        first_parameter : int
           Integer to be factorized; must be >= 1.
    
    
        Returns
        -------
        float
            The gizmo of all things.
    
        """
       return 0.0  
  4. Use trailing commas in element enumerations (lists, dicts, function parameters, etc).

    my_dict = {
        'a': 1,
        'b': 2,
        'c': 3,  # mind the trailing comma!
    }

    This and the previous two guidelines improve readability of code diffs, and minimize errors when extending the code in the future.

  5. Use full sentences in docstrings and code comments, including proper punctuation:

    # This is a comment. It's properly capitalized and explains the "WHY?" for a block of
    # code, its invariants, design considerations and alternatives.
    # Feel to include links to issues/resources where relevant.
    
  6. Docstrings follow the NumPy doc style. Treat them as an overview of the functionality, to anchor a class or method conceptually, and document their parameters. Not to describe how things work internally in detail.

    Any non-obvious tricks and coding patterns that may confuse a literate Python programmer need a source code comment. Explicit is better than implicit.

  7. Mark gotchas and unresolved problems in code comments, using:

    • FIXME for sections that must be resolved before a release.
      • Internal notes for critical bugs, unfinished stubs, temporary debugging code, logging. Anything internal that shouldn't see the light of day.
      • We grep for FIXMEs before a release, to verify none are left.
    • TODO for things that can be finished later, incrementally. "Wish list" for a technical improvement, not blocking a release, yet not fit for a standalone Github issue.
    • XXX, Note: note to other programmers, or (often) your future self.

    If in doubt, make a new Github ticket and describe it sufficiently well for another person to pick up the problem and deal with it.

Documentation

The public API documentation is automatically generated from docstrings, via Sphinx:

pip install ".[docs]"
tox -e compile,docs  # generate new docs version, will be available in docs/src/_build/html
cd docs/src
make upload  # upload new docs to site (need ssh permissions)

Make sure your documentation changes render correctly. If you added a new module, add a new corresponding .rst file under docs/src.

Gensim tutorials and how-tos live in our gallery. See here for how to update the gallery.

Testing

A suite of unit tests is run automatically on each Github pull request.

To run those tests locally on your dev machine, in your own installation of Gensim:

pip install -e .  # compile and install Gensim from the current directory
pip install -e .[docs,test]  # additional dependencies in case you also need to rebuild the HTML docs
pytest gensim  # run tests

Git flow

  • master branch is stable, HEAD is always the latest release
  • develop branch contains the latest code for the next release.
  • various feature branches and PRs, to be merged into develop after review

For a new feature, branch off develop:

$ git checkout -b myfeature develop

We never squash PRs, PRs are merged with their full history, including the merge commit (for easier reverts). Make sure your commit messages are meaningful and clean.

Legal

By submitting your contribution to Gensim, you agree to assign all rights of your changes to me, Radim Řehůřek. For countries where such assignment is not legally possible (e.g. EU), you agree to grant a permanent, irrevocable, royalty-free license to do as I please with your contribution.

This means I will have the full rights to incorporate, distribute and/or further modify your changes, without any fees or restrictions from you. I am not interested in any legalese (Gensim is free, I have no budget for it), so if this doesn't work you, I'm sorry but I cannot accept your code.