Skip to content

Google Summer of Code 2014

Josef Perktold edited this page Mar 2, 2014 · 4 revisions

Statsmodels has participated for five years in GSOC under the umbrella of the the Python Software Foundation. The focus in previous years has been on adding new models. Our priority for 2014 is to improve the current codebase and to finish and integrate models and function that are work in progress, and need refactoring, additional unit tests and the addition of missing functionality.

Introduction

Statsmodels is a library for statistics and econometrics written in Python with some extension using cython. It contains by now many of the most commonly used models for estimation, hypothesis tests and statistical graphs. See our documentation for more information

Guidelines & requirements

Statsmodels will participate in GSoC 2014 under the umbrella of Python Software Foundation. (subject to approval by the PSF)

PSF student guidelines: http://wiki.python.org/moin/SummerOfCode/Expectations

Advice on writing a proposal (written with the Mailman project in mind, but generally applicable)

The most important requirement that we expect from students is a sufficient background in statistics or econometrics. Students should be comfortable with Python (intermediate level). Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed though.

Advice

Potential candidates should to take a look at the guidelines on how to contribute to statsmodels. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to statsmodels already before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Ideas

The general objective is to increase unit test coverage, bring pull requests and higher priority code in the sandbox into a condition so they can be merged. Additional improvements and enhancements can also be added to the current core code. There are many improvements that will not require a large amount of time, below are a non-exhaustive list of ideas, that are mostly larger in terms of the required time. The issues on github will provide a starting point for most cases.

A second set of possible projects is in areas of statistics where statsmodels currently still has major gaps.

Examples:

Close gaps in unit test coverage and fix bugs if necessary: Almost all core code has good functional coverage (verifying correctness) but less common code paths and unusual user inputs are insufficiently tested. Some code on the "fringes" has insufficient test coverage. Some functions need updating for the full integration with and support of pandas data structures.

nonparametric: kernel density estimation and kernel regression: part of it needs additional improvements, part of kernel regression is in the sandbox and needs unit test, bug fixes and some missing pieces.

system of equations, simultaneous equations: a previous GSOC project that needs to be updated to the current statsmodels code base, plus missing test coverage, and possibly additional results.

repeated measures anova: rewrite code in pull request to integrate with pandas and conform to statsmodels code structure.

other Pull Requests: ...

Migrate Pandas.stats to statsmodels: see https://github.com/pydata/pandas/issues/6077

stats Power and effect size: Currently power and sample size calculation provide mainly a low level interface. We need additional effect size calculations and additional functions that make power and sample size calculations easier to use.

Bootstrap, resampling methods: we have bootstrap methods incorporated in several modesl, and there are additional examples and scripts inside and outside of statsmodels. statsmodels is still missing a consistent framework, helper functions and integration of it with existing models.

...

and there are many more areas (see our github Issues, and some SMEPs have related comments)

---

In Statistics, some examples for possible projects are

classical multivariate analysis: pca, factor analysis and canonical correlation analysis There are algorithm for some of this in other python packages, but they either don't provide the full statistical model or don't have the associated statistical results for it.

penalization or regularization approaches for generalized linear or maximum likelihood models Currently the only model with penalized estimation is L1-penalization for discrete models. We don't want to duplicate the excellent facilities of scikit-learn, but there is a large range of use cases and models

loglinear models Both scipy.stats and statsmodels have hypothesis tests for qualitative data and contingency tables, however there is no systematic approach yet to support this.

multiple imputation The only missing value handling that is currently available is to drop observations or cases.

---

(part of text used from scipy GSOC page)