Skip to content

Ideas for Enhancement Projects

josef-pkt edited this page Apr 28, 2013 · 6 revisions

Ideas for Enhancement Projects

editorial comment: This is an updated version of Ideas for Google Summer of Code 2012 Projects

Under Pages there are additional drafts for statsmodels enhancement proposals SMEPs.

The area where the coverage in statsmodels is lacking is still pretty wide. So, if a student has a strong preference, then it should be or might be possible to cover it.

The idea is basically, pick your favorite chapters in an econometrics or statistics book, or R package or Stata topic or any other package for statistical analysis and see what is missing and would be useful to have available with higher priority.

Of course, support for a topic will also depend on the availability of a mentor with sufficient expertise to advice.

The following are some ideas. If you are interested in one of the topics, we can also help with additional information.

Current or Recent Projects

Support for formulas and categorical data

Status: implemented and will be release in statsmodels 0.5 Author: Skipper Seabold based on patsy, Nathaniel Smith's formula package

Convenient support for categorical explanatory variables is still largely lacking in statsmodels. This can follow up on the existing formula implementation of Jonathan and of Nathaniel, and the start of the integration in the statsmodels account on github. The topic is pretty complex and I would recommend it only to someone familiar with the formula framework in R.

Extend linear models to non-linear models

Status: was a GSOC 2012 project, partially finished, Pull Request

Linear_model, robust_linear_model and generalized_linear_model could all take a given non-linear function y = f(x, parameters) instead of the current linear version y = X*beta. Technically this can follow mostly the pattern of the current linear versions, but requires that one gets familiar with all three models.

System of Equations

Status: GSOC 2012 project, Pull Request to be merged multivariate models, seemingly unrelated regression, simultaneous equation models

Empirical Likelihood

Status: GSOC 2012 project, first part merged, will be released in statsmodels 0.5

Nonparametrics - Kernel Methods

Status: GSOC 2012 project, merged, will be released in statsmodels 0.5, some parts in sandbox

Tobit - Censored or Truncated Regression

Status: Pull Request

Survival Models

Status: work in progress Pull Request, under refactoring by Skipper

Extensions to Robust Regression

Status: WIP Pull Request (#452) by Josef , additional work by Virgile

LTS, ELTS, MM-Estimators

Statistical Tests

Status: partially implemented, some in WIP, others missing

The coverage of statistical hypothesis tests is increasing. There are still tests that are missing in statsmodels or scipy.stats, or that have only limited options. Also Results classes for the outcome of statistical tests are currently mostly missing, and need also supporting methods (plot, summary, confidence intervals, ...)

Additional support for power and sample size calculations and for effect sizes calculations just got started.

Ideas and Open Projects

Instrumental Variables and GMM

Generic GMM is mostly implemented in the sandbox, but it has missing pieces. Except for two-stage least squares case no specific models that use GMM are implemented. The possible application areas are wide, one possibility that has been popular in recent years would be support for weak instruments.

Panel data and mixed effects models

These are models with an additional random component that can be either implemented from a statistics or an econometrics viewpoint. The topic is large so some selection has to be taken.

Vincent has a pull request for the basic panel data model (within, between, and one-random-factor models)

Panel data and GMM, or mixed effects models and GEE

similar ideas but different implementation from a statistics or an econometrics viewpoint. Estimation and inference based on moment conditions or estimating equations based on a panel or longitudinal structure of the data.

Mixed Effects Models for Non-Normal Distributions

Review for Generalized Linear Mixed Models: Dean, C. B., and Jason D. Nielsen. 2007. “Generalized Linear Mixed Models: a Review and Some Extensions.” Lifetime Data Analysis 13 (November 14): 497–512. doi:10.1007/s10985-007-9065-x.

Time Series Analysis: non-linear models

A wide range of models where statsmodels is completely lacking. Examples would be threshold models, markov switching models, ...

Time Series Analysis: Factor models, Factor VAR

mainly Stock and Watson and offspring. Interesting would be also to link this up with some of the variable selection procedures in sklearn similar to Bai and Ng.

Time Series Analysis: VECM, Cointegration

extending current vector_ar models to include VECM representation and estimation and the corresponding cointegration estimation.

Time Series Analysis: Bayesian Dynamic Linear Models

adapt and integrate Wes's DLM code (JP: I don't know what the status is.)

Time Series Analysis: GARCH

large parts for univariate GARCH are written and in the sandbox, but needs cleanup, enhancements and verification.

Bootstrap

Statsmodels is missing a systematic framework for bootstrap and other resampling approaches. Some bootstrap is included in several models and parts of statsmodels. A basic framework needs to make the tools (iterators) available, and tie it in with various models, or add them to statistical tests. (Might require more familiarity with the model structure in statsmodels.)

Expand graphics support with matplotlib

Status: slowly increasing, pandas had a GSOC 2012 project statsmodels has some plots with matplotlib included, but compared to other statistical packages there are still gaps. An idea would be to implement graphics with a coverage similar to other statistical packages in a user friendly way.

Matching: Multivariate and Propensity Score Matching with Balance Optimization

Other software packages promise: "Provides functions for multivariate and propensity score matching and for finding optimal balance based on a genetic search algorithm. A variety of univariate and multivariate metrics to determine if balance has been obtained are also provided."

For example:

Other (fill in the details)

two stage models (e.g. Heckman sample selection)

extension to discrete models

non-parametric estimation, extension to kernel regression

....

Clone this wiki locally