ENH: add CovDetMCD and det for regression #9227

josef-pkt · 2024-04-22T15:56:21Z

plan is to add CovDetMCD, CovDetS for covariance and RLMDetS and RLMDetMM for regression.

code for starting sets and CovDetXxx is in robust.covariance

CovDetMCD
no sixpack starting sets yet
The results, cov estimate are in the same neighborhood as R robustbase/rrcov, but still different. I don't see where the difference comes from, I am not sure I understand what R is doing. Currently I'm only comparing "raw" mcd estimates.
example dataset is hbk

statsmodels/robust/covariance.py

pep8speaks · 2024-04-23T16:08:27Z

Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file statsmodels/robust/covariance.py:

Line 32:1: F401 '.scale.qn_scale' imported but unused
Line 32:1: F401 '.scale._scale_iter' imported but unused
Line 1428:21: E124 closing bracket does not match visual indentation

In the file statsmodels/robust/norms.py:

Line 922:13: E129 visually indented line with same indent as next logical line

In the file statsmodels/robust/resistant_linear_model.py:

Line 16:42: E271 multiple spaces after keyword

In the file statsmodels/robust/tools.py:

Line 297:9: E306 expected 1 blank line before a nested definition, found 0

Comment last updated at 2024-05-12 21:03:40 UTC

josef-pkt · 2024-04-23T17:20:09Z

trying out RLMDetS and MM with hbk datasets

DetS with current starts excludes the good influential points 10:14, besides the outliers and bad influential points 0:10.
(S in R robustbase also includes the good influential points 10:14 in the S-estimator. I'm currently not able to get those because I don't allow x-outlier/ influential points in the starting set, and I'm not getting a slope that would catch the 10:14 points)

MM gets the good influential points back
And using RLM with high efficience biweight mean norm, and high breakdown m-scale norm with the same S-estimator start_params produces results very similar to MM.
res_ms = RLM.from_formula("Y ~ X1 + X2 + X3", dta_hbk, M=norm_m).fit(start_params=res.params, scale_est=mscale_biw)
instead of fixed scale as in MM
(I don't remember if this is "restricted M" estimator.

resid scaled for DetS

resid scaled for MM using DetS as start_params

josef-pkt · 2024-04-24T20:22:27Z

R includes the first step of BACON in the r6pack starting sets.
This seems to use mahalanobis distance base on median and euclidean distance.
We apply in to robust zscored data.

This is essentially the same as the initial step in _cov_starting

The BACON article has a version 1 base set that uses standard maha distance with mean and pearson covariance.
Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. 2000. “BACON: Blocked Adaptive Computationally Efficient Outlier Nominators.” Computational Statistics & Data Analysis 34 (3): 279–98. https://doi.org/10.1016/S0167-9473(99)00101-2.

aside:
r6pack includes some kind of reweighting step (that I'm not able to replicate).
I don't see a reference for it, and I have not seen the reweighting step is in the original DetXxx articles,

In my current version my 6pack (or 4 pack right now) differs from the R version in around 4 to 7 indices, but r6pack also excludes all influential points (1:14) including good influential points.

I'm giving up trying to match R Det starting sets exactly. Plus I will keep some of the extra starting sets in _cov_starting.

josef-pkt · 2024-04-24T21:05:43Z

more local search:

in the hbk example include_endog in starting maha distances improves the estimate, M-scale is lower than exog only.

mod_smm = RLMDetSMM(dta_hbk["Y"], exog_df, include_endog=True)
res_smm = mod_smm.fit(h=40)

By default we could use both types of starting sets, exog only and [endog, exog]. Then I need trivariate option for
start data "exog", "endog", "both".
This will increase the chance of finding an (almost) global optimum and still be much cheaper than FastXxx.

josef-pkt · 2024-04-25T16:57:28Z

Problem: What if we have only one regressor besides constant?

It breaks in my current (uncommitted) version, e.g. in spearmanr starting set.
In general the sixpack has several that are only correlation, i.e. diagonal is 1. Spearmanr in my scipy version raises exception.
In other cases (don't remember where) I had used returning only corr coefficient if we have only two variables, instead of 2 by 2 cov matrix.

Example Hertzsprung Russel diagram in notebook https://www.statsmodels.org/dev/examples/notebooks/generated/robust_models_1.html
which is also used in R robustbase lmrob docstring

needs to be fixed for RLMDetS, and maybe special cased or raising exception in covariance.DetMCD and similar.

It looks like my current RLMDet models need at least 3 slope variables in the starting set computation.

problem with scipy, I'm using '1.7.3'

spearmanr has exception, 1 value or cov matrix depending on number of columns

stats.spearmanr(np.random.randn(50, 3))
SpearmanrResult(correlation=array([[ 1.        , -0.11308523,  0.29392557],
       [-0.11308523,  1.        ,  0.13939976],
       [ 0.29392557,  0.13939976,  1.        ]]), pvalue=array([[0.        , 0.43426076, 0.03828331],
       [0.43426076, 0.        , 0.33429506],
       [0.03828331, 0.33429506, 0.        ]]))

stats.spearmanr(np.random.randn(50, 2))
SpearmanrResult(correlation=-0.03442977190876351, pvalue=0.8123713972775561)

stats.spearmanr(np.random.randn(50, 1))
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_2228\3897072665.py in <module>
----> 1 stats.spearmanr(np.random.randn(50, 1))

~\Downloads\WPy64-39100\python-3.9.10.amd64\lib\site-packages\scipy\stats\stats.py in spearmanr(a, b, axis, nan_policy, alternative)
   4479 
   4480     if axisout == 0:
-> 4481         if (a[:, 0][0] == a[:, 0]).all() or (a[:, 1][0] == a[:, 1]).all():
   4482             # If an input is constant, the correlation coefficient
   4483             # is not defined.

IndexError: index 1 is out of bounds for axis 1 with size 1

I guess I can fix the case for k = 2, e.g. in spearmanr,
and add special code for k=1, e.g. h set closest to median, smallest abs(x - median(x)), scale does not matter.

update I replaced scipy spearmanr but new stats.covariance.corr_rank which always returns 2-dim matrix
Then case k=2 works.

Case k=1 without special casing now fails in cov_iter starting set computation

For this case I could get two starting sets

smallest abs(x - median(x)) (metric trimming)
and equal tail trimming: take h // 2 observations from both sides of median, i.e. quantiles, adjusted for count=h.

even if k=1 case, i.e. only 1 exog slope variable, we could still include endog to get to k=2 in dataset for finding starting sets.

another case: k=0, i.e. constant only regression
What do we do?
There are no exog outlier.
Any "outlier" ranking of exog would be arbitrary and not depend on exog values.
But we still have multiple local optima with redescending estimators, e.g. 3 clusters of y, middle cluster is small.
What's the S-estimator for the mean?
standard S-estimator starting with median (and possibly quantiles) and qn_scale (or mad, scale_tau, ...)

Possible problems with 3 or more clusters.
What if the median is not in the main cluster?
E.g. two outside x or/and y clusters with the "outlier" cluster in the middle.

That's not really the use case for "robust" including "resistant", we assume that our model is for the central data.
Identifying and estimating for several clusters is a different issue, e.g. mixed model, cluster analysis, ...
e.g. if no cluster has 50% of the observations.

josef-pkt · 2024-04-25T17:29:12Z

~~(faking MM by adding two random exog)~~ now replaced

This includes MM results of the model with 1 slope parameter, but including endog in the start maha computation

statsmodels/robust/covariance.py

… for 0 or 1 start_exog

josef-pkt · 2024-04-29T15:02:46Z

problem for citing RLMDetS and RLMDetMM

I don't find a reference for it. All the printed articles that I used, are for multivariate location and scatter, and I don't find an article directly for the regression version.
DetR package also does not have any citation directly related to deterministic starts in regression.
There is a conference abstract that includes DetLTS, but I don't find an article for it. https://ww2.amstat.org/meetings/JSM/2010/onlineprogram/AbstractDetails.cfm?abstractid=307805

There is an article (*) for multivariate regression based on MCD scatter matrices in the linear moment conditions.
(Article is older than DetMCD).

Maybe I saw detxxx regression in comments in articles, but I need to go through all the articles again.

(*) Rousseeuw, Peter J., Stefan Van Aelst, Katrien Van Driessen, and Jose Agulló. 2004. “Robust Multivariate Regression.” Technometrics 46 (3): 293–305.

josef-pkt · 2024-05-01T16:37:15Z

notebook draft for S- and MM-regression, based on current (uncommitted) code
https://gist.github.com/josef-pkt/dd18a8ec5e968a3e4c426c98a8aa43d0

strange, the "raw" cell with the results from R does not show up in the gist
fixed now by converting raw to markdown cell

josef-pkt · 2024-05-02T14:47:49Z

I still have problems with norm class versus norm instance as argument to function and model __init__.
continuation of #9186 (comment)

The inplace modification of a user provided norm instance to adjust tuning parameter sounds much too fragile.
But if I don't have an instance as an argument, then I need the extra keyword parameters.

current problem: what should the norm be in RLMDetS, string, class or instance?

    def __init__(self, endog, exog, norm=None, breakdown_point=0.5,
                 col_indices=None, include_endog=False):

If string or class, then I need kwds_norm
If instance, then I need to set the tuning parameter, if breakdown_point is not None.

related problem: RLMDetSMM needs two instances of the norm with different tuning parameters. So, I need to be able to create a new instance with the same kwd arguments but different tuning.

    def __init__(self, endog, exog, norm=None, efficiency=0.95, 
                 breakdown_point=0.5, col_indices=None, include_endog=False):

One possibility is to add clone or copy to create new instances that can be modified.
This will add even more code just to avoid the kwds_norm.
But there is also still the problem that tuning parameters have inconsistent naming across norms. i.e. we don't know generically the keyword name(s) for the norm parameters. :(

solution for now:
I will add a "clone", not "inplace" option to _set_tuning_param. This method has all the norm specific information and we don't need generic copy/clone.
def _set_tuning_param(self, c, inplace=False)

I'm currently only working with TukeyBiweight for DetS and DetMM. Other redescending norms can be added once the design has mostly settled.

josef-pkt · 2024-05-02T18:51:04Z

oops, unit test for tools fail.

inplace=False as default is backwards incompatible with robust.tools.
So, I will use for now the default inplace=True. (amending last commit and force pushing)
However, this should change back to False, when all norm classes have the keyword and can do inplace=False.

josef-pkt · 2024-05-08T22:03:59Z

some notes on CovM, similar to enhanced RLM (with M-scale, usage for delegated to by S-estimator)

weight and scale options, similar to M-estimators that I have already in functions, however several versions how mean and scatter norms or weights are linked
- for S-estimator, common norm: weight function for mean and shape and norm for scale (what I have in draft for usage by CovDetS)
- M-estimator with common norm, objective function, e.g. Kent, Tyler 1991, uniqueness for weakly redescending norms, t(df>=1) has unique scatter
- M-estimator of shape/scatter: scale not estimated, e.g. fixed to 1
- shape estimation with "ad-hoc" scaling, e.g. median maha
general M-estimation: norm for mean and scatter/scale can differ.

I might need a method option method="S" "M", "MM", with maybe extra option about scaling of scatter matrix, if not implied by method.

Problem: I still don't know how to compute efficiency for multivariate case to get the tuning parameter for desired efficiency.
Also, which efficiency, mean or shape, or even cov (which also depends on scale estimator).

Given this CovM, CovS only needs to handle starting sets and CovMM is just a call to CovM with fixed scale.
same pattern as for RLM, RLMDetS and RLMDetMM.
minimal code, but not tuned for computational efficiency and speed.

statsmodels/robust/norms.py

@@ -1,6 +1,6 @@
 import numpy as np

-# TODO: add plots to weighting functions for online docs.
+from . import tools as rtools


statsmodels/robust/tools.py

+def tuning_s_cov(norm, k_vars, breakdown_point=0.5, limits=()):
+    """Tuning parameter for multivariate S-estimator given breakdown point.
+    """
+    from .norms import TukeyBiweight  # avoid circular import


statsmodels/robust/covariance.py

@@ -28,10 +29,16 @@
 import numpy as np
 from scipy import stats, linalg
 from scipy.linalg.lapack import dtrtri
-from .scale import mad
+from .scale import mad, qn_scale, _scale_iter


josef-pkt · 2024-05-10T14:31:53Z

aside: rrcov CovSest and CovMMest only allows for biweight and Rocke norms.
CovMMest docstring does not mention which norm, but I guess it is implied by the Sest options.

currently I'm only working with biweight, but intend to add norm options.

josef-pkt · 2024-05-10T18:17:27Z

oops: Tyler constrained M-estimation requires scale > scale_S (not what I initially thought and used that scale < scale_S)

rho is an increasing function
increasing scale lowers rho(u / s), so lower rho value satisfies inequality constrained rho(u / s) <= s_0

However weights are a decreasing function w(|u| / s)
increasing scale increases weights, and becomes more inclusive
lowering scale lowers weights and becomes more restrictive

However, CM has a different objective from my MM scale adjustment.
CM wants to increase efficiency, i.e. reduce weights and include more points
MM with lower scale wants to find a better local solution without giving up the constrained that we increase scale beyond the MM with fixed scale_0 which would have guaranteed breakdown point.

maybe: to make my MM scaling theoretically clean, I could just re-estimate the MM-estimate with a new S-estimator start given by the MM parameter estimates. An extra detour step as guarantee to have the theoretical assertion on maintaining the breakdown point.

(all because I don't know what the breakdown point is if rho_mean and rho_scale have different breakdown points. Huber has an article but not directly with the result, and/or I don't understand enough to figure it out.)

josef-pkt · 2024-05-10T18:52:24Z

Kudraszow, Nadia L., and Ricardo A. Maronna. 2011. “Estimates of MM Type for the Multivariate Linear Model.” Journal of Multivariate Analysis 102 (9): 1280–92. https://doi.org/10.1016/j.jmva.2011.04.011.

has table 1 tuning parameter for breakdown point and
table 2 tuning parameter for given efficiency of mean parameter estimates
formula for ARE of mean is in equ. 6.6 which has 3 expectations, i.e. 3 integrals.

table 1 is the same as I have for bp=0.5 and k=3
however, for 95% efficiency at k=3 KM have c=5.48 in table 2 and rrcov CovMMest uses 5.49024917207033

CovMMest has c = 6.09626 if 95% efficiency for shape is requested (default)

(Maronna uses rho normalized to max rho = 1. That does not affect the rho function, but it's in terms of u/c.)

josef-pkt · 2024-05-11T14:11:58Z

CovMCD is unfinished and not working correctly yet (does not replicate R), I got distracted by CovM, CovS, covMM
CovMCD methods are just copy-pasted function from my experimental code.

methods take data, data is not in __init__. We need that at least for the helper methods to compute iteration for different starts and subsets of data.
The fit method could use data provided by __init__
This way CovMCD would be like a standard model and like CovM, and we hide the computation with multiple starts in private methods.
In contrast, RLMDetS just delegates the computation to RLM and creates a new RLM instance for every start. For MCD we don't have a different model class to delegate to.
CovDetS will be able to delegate to CovM.

update
stupid mistake call to maha function in CovDetMCD is wrong. The mahalanobis function does not take mean as argument
call is d = mahalanobis(x - mean, cov) instead of
d = mahalanobis(x, mean, cov)

better change signature of maha function, I make the same mistake also in interactive work.

My current uncommitted CovDetMCD now produces the same results as rrcov
CovMcd(x = hbk, raw.only = TRUE, nsamp = "deterministic", use.correction=FALSE)

statsmodels/robust/covariance.py

+
+            if ((cov - cov_new)**2).mean() < tol:
+                cov = cov_new
+                converged = True


statsmodels/robust/covariance.py

+        """
+
+        x = self.data
+        nobs, k_vars = x.shape


statsmodels/robust/covariance.py

+        """
+
+        x = self.data
+        nobs, k_vars = x.shape


josef-pkt · 2024-05-13T12:03:23Z

a bit strange:
default maxiter in refinement step for CovDetMCD was hardcoded at 100.
If I fix it and reduce maxiter_step to recommended 2, then unit test fails.
It looks like my current implementation requires large maxiter for every starting set.

one possible reason: Currently I only iterate the one best start to large maxiter. DetMCD article uses 2 best starts.

ENH: add CovDetMCD preliminary version

021cd2d

josef-pkt added type-enh comp-robust labels Apr 22, 2024

github-advanced-security bot found potential problems Apr 22, 2024

View reviewed changes

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed

ENH: first version of RLMDetS

c1a2574

ENH: add minimal RLMDetSMM (including mscale version with constraint)

68a79f9

ENH/REF: add other cov to deterministic start, some ref and bug

c79c89d

github-advanced-security bot found potential problems Apr 25, 2024

View reviewed changes

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed

josef-pkt mentioned this pull request Apr 26, 2024

BUG: RLM fit with start_params raises if only one parameter #9233

Open

BUG/ENH: fix RLM start_params closes statsmodels#9233. enable RLMDetS…

a3e739a

… for 0 or 1 start_exog

josef-pkt mentioned this pull request Apr 29, 2024

ENH/BUG: fixed scale in RLM, float scale_est #9211

Closed

ENH/REF: norms in resistant models, smaller ENH in older code

bb0a83c

josef-pkt force-pushed the robust_resistant branch from 41b4bb2 to bb0a83c Compare May 2, 2024 19:03

REF: norms _set_tuning_param default to inplace=False

530be61

josef-pkt mentioned this pull request May 10, 2024

ENH: robust default start_params in M-estimators, RLM, and CovM ? #9243

Open

ENH: starting with CovM and CovS, small doc fix in TukeyBiweight

133de40

github-advanced-security bot found potential problems May 10, 2024

View reviewed changes

josef-pkt mentioned this pull request May 10, 2024

Reference: scaling of scatter matrix to get covariance #3220

Open

REF/BUG: fix CovDetMCD

1adf4c5

github-advanced-security bot found potential problems May 12, 2024

View reviewed changes

statsmodels/robust/covariance.py

if ((cov - cov_new)**2).mean() < tol:

cov = cov_new

converged = True

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable converged is not used.

ENH: add CovDetS, working version

0da74af

github-advanced-security bot found potential problems May 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add CovDetMCD and det for regression #9227

ENH: add CovDetMCD and det for regression #9227

josef-pkt commented Apr 22, 2024

pep8speaks commented Apr 23, 2024 •

edited

josef-pkt commented Apr 23, 2024

josef-pkt commented Apr 24, 2024

josef-pkt commented Apr 24, 2024

josef-pkt commented Apr 25, 2024 •

edited

josef-pkt commented Apr 25, 2024 •

edited

josef-pkt commented Apr 29, 2024

josef-pkt commented May 1, 2024 •

edited

josef-pkt commented May 2, 2024

josef-pkt commented May 2, 2024

josef-pkt commented May 8, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 11, 2024 •

edited

josef-pkt commented May 13, 2024

ENH: add CovDetMCD and det for regression #9227

Are you sure you want to change the base?

ENH: add CovDetMCD and det for regression #9227

Conversation

josef-pkt commented Apr 22, 2024

pep8speaks commented Apr 23, 2024 • edited

Comment last updated at 2024-05-12 21:03:40 UTC

josef-pkt commented Apr 23, 2024

josef-pkt commented Apr 24, 2024

josef-pkt commented Apr 24, 2024

josef-pkt commented Apr 25, 2024 • edited

josef-pkt commented Apr 25, 2024 • edited

josef-pkt commented Apr 29, 2024

josef-pkt commented May 1, 2024 • edited

josef-pkt commented May 2, 2024

josef-pkt commented May 2, 2024

josef-pkt commented May 8, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 10, 2024

josef-pkt commented May 11, 2024 • edited

josef-pkt commented May 13, 2024

pep8speaks commented Apr 23, 2024 •

edited

josef-pkt commented Apr 25, 2024 •

edited

josef-pkt commented Apr 25, 2024 •

edited

josef-pkt commented May 1, 2024 •

edited

josef-pkt commented May 11, 2024 •

edited