Skip to content

SMEP D: Naming Convention

Josef Perktold edited this page May 23, 2014 · 8 revisions

SMEP D: Naming Convention

some preliminary notes on naming convention

general rules

  • qualifiers are post-fix: important for alphabetical lists as in code completion, and help

current rules

  • nobs, n_xxx, k_vars, k_xxx for naming convention

open issues

Not standardized yet across models. In several cases we should standardize the meaning and we already have identical names.

params

params are the parameters that are used in optimization, and for which post-estimation inference (bse, cov_params, tests on parameters) is directly available. Traditionally, these were coefficients in the linear part of the model. In several models we have now extra parameters besides the linear coefficients. Do we a use specific name for the linear coefficients, beta, mean_params, params_mean, ... ? Also the names of the extra parameters are model specific.

tsa model have params with several components but they are model specific, ar, ma, trend, exog.

fittedvalues, resid

see https://github.com/statsmodels/statsmodels/issues/411 and maybe others

My (JP) view

  • fittedvalues = E[y | X]

- resid = y - E[y | X] should be in all models where it applies, tsa might have conditional E[y_t | X_t, past data] instead (?) resid might not always have a useful statistical interpretation and we removed it or left it undefined in several models, but it's also useful as intermediate variable.

  • resid_pearson (y - E[y | X]) / E(sigma | X)

    maybe unclear does E(sigma | X) include scale, relevant for WLS, overdispersed Poisson

Note: if we want to deprecate fitted_values without breaking backwards compatibility, then I'd like to use fitted_values as new name. (don't be stingy on undeline)

other resid_xxx are model specific.

counting number of parameters

relevant for df_model, df_resid, aic, bic, k_params, k_exog, ... see issue #1624 for some comments on this, extra parameters in NegativeBinomial, counting scale or not

still to come nobs versus n_rows versus weights_sum if we have frequency weights

robust covariance estimation

cov_type, ... currently inconsistent and still missing in several models

Fixed Parameters

e.g. https://github.com/statsmodels/statsmodels/pull/1398/files#r11633744

Two cases:

  • Hardcoded choice A parameter can be either fixed or not fixed, but users cannot freely choose which parameter to fix during estimation. This is partially an internal tool. examples scale in RLM, df in TModel, shape parameters in GLM (not implemented, example negative binomial. Related GLSAR can estimate with fixed AR(1) coefficient or estimate it, but it uses currently different methods.
  • Arbitrary fixed parameters Any of the params can be fixed at a specific value by users. Essentially, we need a mask or index plus the values of the fixed parameters. No consistent pattern yet.

Linear Constraints

(from post to mailing list) We never got good names for the matrix and vector in the constraint.

The general form in econometrics notation is R beta = r

in t_test and f_test methods we use r_matrix and q_matrix (R params = q to avoid capitalized variables) RLS in the sandbox uses constraint and param GEE in master uses lhs and rhs my stochastic restriction in TheilGLS also use r_matrix and q_matrix

I'm not really a fan of lhs and rhs, because I sometimes write it reversed and it's formal or positional instead of descriptive. But I'm not a big fan of single letter names either, r_matrix and q_matrix are just single letters made verbose.

I would like something descriptive like constraint_matrix, constraint_value, but I don't mange to come up with anything better or shorter.

Stata and statsmodels formula users won't care much, because it's (supposed to be) supported as algebraic formula string "x1 + x2 = 1"

Stata uses "Cns constraints matrix" to store the matrix, Cns is the variable name. It doesn't seem to store the q_matrix in an accessible way.

It's all Greek to me

things I can never keep apart, even if I looked it up already 10 times:

  • mu versus eta in GLM
  • psi and rho in RLM
  • alpha versus lambda in penalized, fit_regularized
  • alpha for significance level sounds ok, but could also change

Graphics

We have a strict convention on how to create plot (ax argument and returning fig) but details for argument list and names are still inconsistent.

see https://github.com/statsmodels/statsmodels/issues/487#issuecomment-39671549 for function and method names.

Others

  • `optim_poptions`: move optimization keyword arguments into a dict ?
Clone this wiki locally