Skip to content

Google Summer of Code 2018

Josef Perktold edited this page Feb 26, 2018 · 7 revisions

Note: This is currently a draft page. Topics have been updated to account for priorities in 2018 but might still change.

Statsmodels has participated for nine years in GSOC under the umbrella of the the Python Software Foundation. The focus in previous years has been on adding new models. There are still several areas where statsmodels is missing commonly used models, we also have several models that have been worked on but still need work to finish, add unit tests and integrate into statsmodels, and finally there are several areas where existing models can be extended. One important consideration in the selection of the project is the background of the student, and it is an advantage if the student is familiar with the topic and may be using it also in her or his research.

Introduction

Statsmodels is a library for statistics and econometrics written in Python with some extension using cython. It contains by now many of the most commonly used models for estimation, hypothesis tests and statistical graphs. See our documentation for more information. The developer pages describe in more details how to make contributions to statsmodels and our work flow for pull requests. Our issues are also on github, which include bug reports and wishlist items amd enhancement plans and ideas.

Guidelines & requirements

We are planning again to participate in GSoC 2018 under the `umbrella of Python Software Foundation.

The PSF getting started page http://python-gsoc.org/#gettingstarted and the student guidelines http://wiki.python.org/moin/SummerOfCode/Expectations provide detailed information about the program and requirements and expectations.

The most important requirement that we expect from students is a sufficient background in statistics or econometrics. Students should be comfortable with Python (intermediate level). Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed.

Advice

Potential candidates should take a look at the guidelines on how to contribute to statsmodels. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to statsmodels already before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Ideas

We encourage students to propose their own projects, but we also have several areas that are high on our priority list. Our priority list is flexible, and it is important that the topic matches the interest and background of the student.

Note the difficulty level depends on the statistics/econometrics background and on the familiarity with the current statsmodels code.

common to all projects:

  • domain-specific knowledge: high level of statistics or econometrics knowledge for the specific topic
  • programming language: Python, intermediate level

possible ideas for 2018

(initial suggestions, will be revised. More details and lower priority and old project ideas are listed below)

One high priority area is to continue our good coverage of time series models, specifically:

  • automatic forecasting
  • Markov-Switching VAR or other statespace models

Other topics with higher priority in 2018 are

  • parametric survival and failure time models
  • Post-estimation Inference and Diagnostics for GLM and discrete models
  • Poisson or GLM estimation of survival or failure time models with grouped or interval censored observations
  • Treatment effect estimation, Causality, Propensity score methods, needs to be narrowed down to a feasible plan

possible topics in statespace and time series models

We would like to have one project that continuous the development of statespace and related models. This is still a large area and student and mentor will need to narrow it down to project proposal that is of interest to the student and the mentor(s).

Automatic Forecasting

difficulty: easy to intermediate

mentor: Chad Fulton, Josef Perktold

Statsmodels still lacks support for automatic forecasting in statespace and other time series model. We have many of the basic time series models but automatic specification or model search, checking for outliers or other properties of time series that are relevant for forecasting are not generally available.

A couple of possible state space topics

difficulty: intermediate to hard

mentor: Chad Fulton, Josef Perktold

  • Markov-Switching VARs: We have the Cython versions of the Hamilton Filter and Smoother, so I think we can do e.g. Krolzig's stuff (I think he develops the EM algorithm for the class he considers).
  • We now have the Cython smoothers, so the EM algorithm is possible in state space models. We could try to add that to various models.
  • Specific models that are missing: ARFIMA, multivariate unobserved components models, more complex cycles, now-casting type models and (non-state-space) MIDAS models
  • I want to get more postestimation results for state space models (e.g. IRF confidence intervals)
  • If someone really liked unobserved components models, a project could be to make a really comprehensive implementation in some way (e.g. Harvey's 1989 book has many more extensions, hypothesis / specification tests, etc. than we have) I think that a really well-developed model could be pretty nice.
  • If someone really liked VARIMA models, there's a bunch to be done there, as far as identification and estimation.
  • Nonlinear / non-Gaussian state space models.
  • We still don't have the framework for linear restrictions (this is pretty easy, and it's not in there because I've never used it myself or seen it used practically)

Non-state-space:

  • We still don't have forecasting or IRFs, etc. for any of the Markov switching models. Also the EM algorithms are not entirely "correct" (so right now they're private and only used for initial values for the usual fit methods).

Post-estimation Inference and diagnostic tests especially for GLM

GLM currently has no analysis of deviance, analogue of anova_lm, or similar convenient method to compare nested models. Diagnostic and specification tests, and influence and outlier methods are only available for OLS and partially for WLS. The third part of diagnostics are plots like regression or residual plot to help the visual inspection of the appropriateness of the model specification. Similar functions for GLM or other nonlinear maximum likelihood models are still missing. Some methods are described in the documentation of SAS or other packages. This will for most parts a collection of functions similar to what is available for OLS.

preliminary work for outlier and influence measures are in https://github.com/statsmodels/statsmodels/issues/4268 which also contains links to other diagnostic issues.

difficulty: easy to intermediate

mentor: Josef Perktold, Kerby Shedden

Add Maximum Likelihood Models for other distributions - Survival and Failure Times

This is a relatively easy project in the sense that it can largely follow the existing patterns of current models. There is a large variety of distributions that can be added as Maximum Likelihood Models. This year the priority will be for parametric survival or failure time models, especially accelerated failure time models and similar. Some examples are exponential, Weibull, Generalized Gamma. References are available in the documentation for the corresponding models in R or Stata. Some preliminary work and references are now in https://github.com/statsmodels/statsmodels/issues/4217 .

Note: New count models have been developed during GSOC 2017.

difficulty: easy to intermediate

mentor: Josef Perktold, Kerby Shedden

Extensions to State Space Models

Statsmodels includes now a general purpose Kalman filter and state space model.

see above

difficulty: intermediate to hard

mentor: Chad Fulton, Josef Perktold

Survival Models

statsmodels has Cox proportional hazard model and Kaplan-Meier included. One possible extension would be to extend Cox proportional hazard model to time varying explanatory variables, and add a Poisson or generalized linear model representation that can be used for semi-parametric estimation, e.g. using splines for the baseline hazard.

difficulty: intermediate

mentor: Kerby Shedden, Josef Perktold

Propensity score matching, and treatment effects estimation

High priority but it is a large topic. Needs to be narrowed down depending on interest.

This is another area that is currently missing in statsmodels. There are some projects outside of statsmodels that partially implement it in Python. One possibility is to implement the equivalent of Stata's psmatch or the new tseffects, or similar packages in R, or GSOC sized parts of it. Pr #2288 has an implementation of the basic parts, related discussion is in issue #858 Some related packages are also available in Python with a compatible license. Other areas are inverse probability weighting, regression discontinuity, difference in difference type of causal inference. (Some parts have draft versions in PRs.)

difficulty: intermediate

mentor: Josef Perktold, Kerby Shedden

Panel Data

Dynamic panel data models are an important category of models that are not yet available in Python. The objective is to implement system estimators similar to what is available in other econometrics packages, e.g. xtabond in Stata. Blundell and Bond (1998) Arellano and Bond (1991) Arellano and Bover (1995)

Static panel data models are not not a priority for 2018. The following is just the current status as reference. Panel data models is still one large category of basic models that are currently missing in statsmodels. There is a pull request for the standard econometrics model (PR #1133 ). Standard linear panel data models are now largely covered by the linearmodels package https://pypi.python.org/pypi/linearmodels.

difficulty: intermediate

mentor: Kevin Sheppard, Josef Perktold

Other possible projects

bring your own We are always open to additional topics if the background and interest of the student indicates a high probability for a successful GSOC project.

classical multivariate analysis:

There are algorithm for some of this in other python packages, but they either don't provide the full statistical model or don't have the associated statistical results for it. This area is currently work in progress and we expect to merge more pull requests before GSOC. However, this is still an area that needs expansion. update: several new methods like MANOVA and Factor Analysis including rotation have been added in the last year in at least a basic version. Multivariate OLS is currently a stub version in support of MANOVA and needs to be completed.

other missing standard econometric models:

Several standard econometrics models are not yet available in statsmodels, such as endogenous regressors, instrumental variables for nonlinear or nonnormal models, selection models or endogenous switching models, and more.

other possible topics

  • sparse matrix support in models
  • Survey methods and adding weighting to GLM, Cox, etc. (unmerged GSOC 2017)
  • BigGLM and related high dimensional/distributed computing approaches to big regression models (not a priority this year)
  • Basic Structural Equations Modeling (SEM) (not a priority this year, too large)
  • ...
  • ...

Cleanup, Refactor and integrate unfinished projects

difficulty: hard for GSOC, requires familiarity with large parts of current statsmodels code.

The general objective is to increase unit test coverage and to bring pull requests and higher priority code in the sandbox into a condition so they can be merged. Additional improvements and enhancements can also be added to the current core code. There are many improvements that will not require a large amount of time, below are a non-exhaustive list of ideas, that are mostly larger in terms of the required time. The issues on github will provide a starting point for most cases.

Close gaps in unit test coverage and fix bugs if necessary: Almost all core code has good functional coverage (verifying correctness) but less common code paths and unusual user inputs are insufficiently tested. Some code on the "fringes" has insufficient test coverage. Some functions need updating for the full integration with and support of pandas data structures.

system of equations, simultaneous equations: a previous GSOC project that needs to be updated to the current statsmodels code base, plus missing test coverage, and possibly additional results.

Migrate Pandas.stats to statsmodels: see https://github.com/pydata/pandas/issues/6077

Power and effect size: Currently power and sample size calculation provide mainly a low level interface. We need additional effect size calculations and additional functions that make power and sample size calculations easier to use.

Bootstrap, resampling methods: we have bootstrap methods incorporated in several models, and there are additional examples and scripts inside and outside of statsmodels. statsmodels is still missing a consistent framework, helper functions and integration of it with existing models.

Clone this wiki locally