New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add CovDetMCD and det for regression #9227
base: main
Are you sure you want to change the base?
Conversation
Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-05-12 21:03:40 UTC |
trying out RLMDetS and MM with hbk datasets DetS with current starts excludes the good influential points 10:14, besides the outliers and bad influential points 0:10. MM gets the good influential points back resid scaled for DetS |
R includes the first step of BACON in the r6pack starting sets. This is essentially the same as the initial step in The BACON article has a version 1 base set that uses standard maha distance with mean and pearson covariance. aside: In my current version my 6pack (or 4 pack right now) differs from the R version in around 4 to 7 indices, but r6pack also excludes all influential points (1:14) including good influential points. I'm giving up trying to match R Det starting sets exactly. Plus I will keep some of the extra starting sets in |
more local search: in the hbk example include_endog in starting maha distances improves the estimate, M-scale is lower than exog only.
By default we could use both types of starting sets, exog only and [endog, exog]. Then I need trivariate option for |
Problem: What if we have only one regressor besides constant? It breaks in my current (uncommitted) version, e.g. in spearmanr starting set. Example Hertzsprung Russel diagram in notebook https://www.statsmodels.org/dev/examples/notebooks/generated/robust_models_1.html needs to be fixed for RLMDetS, and maybe special cased or raising exception in covariance.DetMCD and similar. It looks like my current RLMDet models need at least 3 slope variables in the starting set computation. problem with scipy, I'm using '1.7.3' spearmanr has exception, 1 value or cov matrix depending on number of columns
I guess I can fix the case for k = 2, e.g. in spearmanr, update I replaced scipy spearmanr but new stats.covariance.corr_rank which always returns 2-dim matrix Case k=1 without special casing now fails in cov_iter starting set computation For this case I could get two starting sets
even if k=1 case, i.e. only 1 exog slope variable, we could still include endog to get to k=2 in dataset for finding starting sets. another case: k=0, i.e. constant only regression Possible problems with 3 or more clusters. That's not really the use case for "robust" including "resistant", we assume that our model is for the central data. |
… for 0 or 1 start_exog
problem for citing RLMDetS and RLMDetMM I don't find a reference for it. All the printed articles that I used, are for multivariate location and scatter, and I don't find an article directly for the regression version. There is an article (*) for multivariate regression based on MCD scatter matrices in the linear moment conditions. Maybe I saw detxxx regression in comments in articles, but I need to go through all the articles again. (*) Rousseeuw, Peter J., Stefan Van Aelst, Katrien Van Driessen, and Jose Agulló. 2004. “Robust Multivariate Regression.” Technometrics 46 (3): 293–305. |
notebook draft for S- and MM-regression, based on current (uncommitted) code strange, the "raw" cell with the results from R does not show up in the gist |
I still have problems with norm class versus norm instance as argument to function and model The inplace modification of a user provided norm instance to adjust tuning parameter sounds much too fragile. current problem: what should the
If string or class, then I need related problem: RLMDetSMM needs two instances of the norm with different tuning parameters. So, I need to be able to create a new instance with the same kwd arguments but different tuning.
One possibility is to add clone or copy to create new instances that can be modified. solution for now: I'm currently only working with TukeyBiweight for DetS and DetMM. Other redescending norms can be added once the design has mostly settled. |
oops, unit test for tools fail.
|
41b4bb2
to
bb0a83c
Compare
some notes on CovM, similar to enhanced RLM (with M-scale, usage for delegated to by S-estimator)
I might need a method option Problem: I still don't know how to compute efficiency for multivariate case to get the tuning parameter for desired efficiency. Given this CovM, CovS only needs to handle starting sets and CovMM is just a call to CovM with fixed scale. |
@@ -1,6 +1,6 @@ | |||
import numpy as np | |||
|
|||
# TODO: add plots to weighting functions for online docs. | |||
from . import tools as rtools |
Check notice
Code scanning / CodeQL
Cyclic import Note
statsmodels.robust.tools
def tuning_s_cov(norm, k_vars, breakdown_point=0.5, limits=()): | ||
"""Tuning parameter for multivariate S-estimator given breakdown point. | ||
""" | ||
from .norms import TukeyBiweight # avoid circular import |
Check notice
Code scanning / CodeQL
Cyclic import Note
statsmodels.robust.norms
@@ -28,10 +29,16 @@ | |||
import numpy as np | |||
from scipy import stats, linalg | |||
from scipy.linalg.lapack import dtrtri | |||
from .scale import mad | |||
from .scale import mad, qn_scale, _scale_iter |
Check notice
Code scanning / CodeQL
Unused import Note
Import of '_scale_iter' is not used.
aside: rrcov CovSest and CovMMest only allows for biweight and Rocke norms. currently I'm only working with biweight, but intend to add norm options. |
oops: Tyler constrained M-estimation requires scale > scale_S (not what I initially thought and used that scale < scale_S) rho is an increasing function However weights are a decreasing function w(|u| / s) However, CM has a different objective from my MM scale adjustment. maybe: to make my MM scaling theoretically clean, I could just re-estimate the MM-estimate with a new S-estimator start given by the MM parameter estimates. An extra detour step as guarantee to have the theoretical assertion on maintaining the breakdown point. (all because I don't know what the breakdown point is if rho_mean and rho_scale have different breakdown points. Huber has an article but not directly with the result, and/or I don't understand enough to figure it out.) |
Kudraszow, Nadia L., and Ricardo A. Maronna. 2011. “Estimates of MM Type for the Multivariate Linear Model.” Journal of Multivariate Analysis 102 (9): 1280–92. https://doi.org/10.1016/j.jmva.2011.04.011. has table 1 tuning parameter for breakdown point and table 1 is the same as I have for bp=0.5 and k=3 CovMMest has c = 6.09626 if 95% efficiency for shape is requested (default) (Maronna uses rho normalized to max rho = 1. That does not affect the rho function, but it's in terms of u/c.) |
CovMCD is unfinished and not working correctly yet (does not replicate R), I got distracted by CovM, CovS, covMM methods take data, data is not in update better change signature of maha function, I make the same mistake also in interactive work. My current uncommitted CovDetMCD now produces the same results as rrcov |
""" | ||
|
||
x = self.data | ||
nobs, k_vars = x.shape |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning
redefined
""" | ||
|
||
x = self.data | ||
nobs, k_vars = x.shape |
Check warning
Code scanning / CodeQL
Variable defined multiple times Warning
redefined
a bit strange: one possible reason: Currently I only iterate the one best start to large maxiter. DetMCD article uses 2 best starts. |
plan is to add CovDetMCD, CovDetS for covariance and RLMDetS and RLMDetMM for regression.
code for starting sets and CovDetXxx is in robust.covariance
CovDetMCD
no sixpack starting sets yet
The results, cov estimate are in the same neighborhood as R robustbase/rrcov, but still different. I don't see where the difference comes from, I am not sure I understand what R is doing. Currently I'm only comparing "raw" mcd estimates.
example dataset is hbk