Skip to content

SMEP: System of Equations (GSOC)

josef-pkt edited this page Apr 28, 2013 · 2 revisions

SMEP: System of Equations (GSOC)

Alexandre Crayssac (Josef Perktold) GSOC 2012

status : Pull request, basics finished, needs review and merging (possibly more results methods, statistics)

Objective

Provide the capability to estimate system of linear equations within statsmodels and provide tools for statistical tests.

Abstract

Statsmodels provides classes and functions for the estimation of many different statistical models, currently it has many features but no support for estimating system of structurally related equations. Since many statistical analyses (e.g., econometrics and biostatistics) are based on system of equations, my proposal is to provide the capability to estimate system of linear equations within the statsmodels module and provide tools for statistical tests. Name and contact

Project

The project proposes to provide classes and functions to easily manipulate, estimate and test system of linear regressions within the statsmodels module. Such software already exists and I decide to base my proposed features on the systemfit package for R (see [1]) which has the advantages of being open source and since it has been tested on a variety of datasets its reliability is demonstrated.

We restrict the field of study to system of equations which contains only linear equations between response and explanatory variables (see [3] for a general discussion). There exists two main type of system of linear equations. The more traditional multivariate linear model does not allow for the response variable of one regression equation to appear as predictor in another equation. On the contrary, we speak of simultaneous equation system (SEM) when this restriction is waived. These structural equations are meant to represent causal relationships among the variables in the model.

First I will focus on estimating systems of equations with purely exogenous regressors because the computation of parameters follow straightforward and general matrix computations. In this case the system of equations can be consistently estimated by ordinary least squares (OLS), weighted least squares (WLS), and seemingly unrelated regression (SUR). If the disturbances across equations are not contemporaneously correlated and have the same variance in each equation, the GLS estimator is equivalent to OLS and it is efficient. The weighted least squares (WLS) estimator allows for different variances of the disturbance terms in the different equation but assumes that the disturbance terms are not contemporaneously correlated. If the disturbances are contemporaneously correlated, a generalized least squares (GLS) estimation leads to an efficient estimator for the coefficients. In this case, the GLS estimator is generally called "seemingly unrelated regression" (SUR) estimator.

Then I plan to implement methods for estimating system with endogenous regressors and checking identification conditions (rank and order conditions). In this case OLS, WLS and SUR estimates are biased. I will implement 2SLS, W2SLS and 3SLS methods that rely on the use of instrumental variables. The two-stage least squares (2SLS) estimator is based on the same assumptions about the disturbance terms as the OLS estimator. The weighted two-stage least squares (W2SLS) estimator allows for different variances of the disturbance terms in the different equations. If the disturbances are contemporaneously correlated, a feasible generalized least squares (FGLS) version of the 2SLS estimation leads to consistent and asymptotically more efficient estimates, this is a 3SLS procedure.

Besides we will provide many statistics about regressions like estimator of the covariance matrix of the estimated coefficients, covariance matrix of the residuals, and degrees of freedom. Finally, if time permits I will implement methods for estimation under linear restrictions on the coefficients and/or provide classical statistical tests dealing with system of equations, and/or implement the full information maximum likelihood (FIML) model.

Timeline

I already set up my development environment (including python, modules and git), and I have made a patch to the existing partial code for estimating system of equations in statsmodels (see [2]).

Weeks [1,4]

Data structures : user and internal representation of data (using pandas) Examples/Unit tests : pick some text book examples Refactoring SUR existing code Add support for OLS, WLS

Weeks [5,11]

Add linear and cross-equation restrictions for above models Classes for SEM, in particular we need maintaining a good separation between what is endogenous and what is exogenous Framework for specifying instrumental variable models Add support for 2SLS, W2SLS and 3SLS Add support for LIML/FIML

Weeks [12,13]

Implementing some statistical tests related to systems Testing and improving documentation If time: Refactoring SVAR to take advantage of the systems of equations code If time: linear and cross-equation restrictions for SEM models

References

[1] http://cran.r-project.org/web/packages/systemfit/vignettes/systemfit.pdf

[2] https://github.com/statsmodels/statsmodels/pull/198

[3] Russell Davidson and James G. MacKinnon (2004). Econometric Theory and Methods. Oxford University Press, chapter 12