SMEP: Nonparametric Estimation (GSOC)

George Panterov (Ralf Gommers) GSOC 2012

status: merged, kernel regression partially unfinished in sandbox

Objective

Develop the nonparametric capabilities of statsmodels by focusing in particular on data-driven bandwidth selection procedures, conditional and unconditional multivariate probability and cumulative density estimation and implementing popular nonparametric and semiparametric regression models.

Abstract

As we get better at collecting data and as our computational resources continue to increase, nonparametric methods (despite the fact that they require more data and are more computationally intensive) will become more and more appealing to researchers. There are several commercial packages such as Matlab and Mathematica that can currently handle some nonparametric estimation. In addition some open source packages like R have libraries that can handle some nonparametric estimation [4] . The goal of this project is to develop an open-source, Python-based alternative to these sources within statsmodels (see [1] and [6]) which would make the package even more appealing to practitioners and academics and hopefully make Python the primary choice for computational work.

The main focus of my summer work will be to expand the current nonparametric capabilities of statsmodels [2] [3], in three main directions: develop the fully data-driven bandwidth selection methods and improve the existing “rule-of-thumb” methods; make it possible to handle conditional and unconditional multivariate kernel density estimation; and work on popular nonparametric models (see the textbook Nonparametric Econometrics by Qi Li and Jeff Racine, 2007)

Project Schedule

Pre-GSoC Get familiar with the profiling tools for Python and organize and familiarize with the existing code in the sandbox [3]. Look for tutorials for optimization for speed of the data-driven methods for bandwidth selection.

Week 1 – 2 (May 21 – June 3) Start work on the bandwidth selection methods. Add to the current “rule-of-thumb” methods, fully data-driven methods such as likelihood cross and least-squares cross validation and the Hurvich, Simonoff and Tsai (1998) bandwidth selection method. Introduce several “plug-in” bandwidth selection procedures for some of the more popular distributions. Extend the kernel library to handle categorical variables.

Week 2 – 4 (June 4 – June 17) Begin work on two major classes: multivariate unconditional density estimator and multivariate conditional density estimators. Adapt the existing bandwidth selection procedures to handle the multivariate density estimation. Create two more classes that will estimate the cumulative densities in the conditional and unconditional case.

Week 4 – 6 (June 18 – July 1) Develop a class that fits nonparametric regression models of the type y=g(x)+e, where x is multivariate, and implements the local constant kernel estimator and the local linear kernel estimator proposed by Stone(1977) and Cleveland (1979) with appropriate significance tests and marginal effects.

Week 6 – 8 (July 2 – July 15) Midterm (July 13) . The work between week 1 and week 6 will form the backbone of the models to come. Code the appropriate tests for the conditional, unconditional density estimators and the nonparametric regression. Cross-check results with the nonparametric package “np” written for R and make sure all computational methods are working properly [4].

Week 8 – 10 (July 16 – July 29) Begin work on extending the model library. Write two classes that can fit semiparametric Tobit models and semiparametric censored regression models. Write appropriate tests for the models.

Week 10 – 12 (July 30 – August 12) Explore the feasibility of including more advanced models such as nonparametric simultaneous equation models and nonparametric panel data models. Check if there is existing code that overlaps and start the groundwork. These should overlap with the current existing capabilities of statsmodels [1]. Begin work on the documentation for the models and start writing tests for the nonparametric models developed in the second half of the summer. Compare results with other existing packages.

Week 12 - (August 13 - ) Polish up and improve any remaining issues with the code. Ensure that any issues with the documentation are complete.