Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add CovDetMCD and det for regression #9227

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

josef-pkt
Copy link
Member

plan is to add CovDetMCD, CovDetS for covariance and RLMDetS and RLMDetMM for regression.

code for starting sets and CovDetXxx is in robust.covariance

CovDetMCD
no sixpack starting sets yet
The results, cov estimate are in the same neighborhood as R robustbase/rrcov, but still different. I don't see where the difference comes from, I am not sure I understand what R is doing. Currently I'm only comparing "raw" mcd estimates.
example dataset is hbk

statsmodels/robust/covariance.py Fixed Show fixed Hide fixed
statsmodels/robust/covariance.py Fixed Show fixed Hide fixed
statsmodels/robust/covariance.py Fixed Show fixed Hide fixed
statsmodels/robust/covariance.py Fixed Show fixed Hide fixed
@pep8speaks
Copy link

pep8speaks commented Apr 23, 2024

Hello @josef-pkt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 32:1: F401 '.scale.qn_scale' imported but unused
Line 32:1: F401 '.scale._scale_iter' imported but unused
Line 1428:21: E124 closing bracket does not match visual indentation

Line 922:13: E129 visually indented line with same indent as next logical line

Line 16:42: E271 multiple spaces after keyword

Line 297:9: E306 expected 1 blank line before a nested definition, found 0

Comment last updated at 2024-05-12 21:03:40 UTC

@josef-pkt
Copy link
Member Author

trying out RLMDetS and MM with hbk datasets

DetS with current starts excludes the good influential points 10:14, besides the outliers and bad influential points 0:10.
(S in R robustbase also includes the good influential points 10:14 in the S-estimator. I'm currently not able to get those because I don't allow x-outlier/ influential points in the starting set, and I'm not getting a slope that would catch the 10:14 points)

MM gets the good influential points back
And using RLM with high efficience biweight mean norm, and high breakdown m-scale norm with the same S-estimator start_params produces results very similar to MM.
res_ms = RLM.from_formula("Y ~ X1 + X2 + X3", dta_hbk, M=norm_m).fit(start_params=res.params, scale_est=mscale_biw)
instead of fixed scale as in MM
(I don't remember if this is "restricted M" estimator.

resid scaled for DetS

image

resid scaled for MM using DetS as start_params
image

@josef-pkt
Copy link
Member Author

R includes the first step of BACON in the r6pack starting sets.
This seems to use mahalanobis distance base on median and euclidean distance.
We apply in to robust zscored data.

This is essentially the same as the initial step in _cov_starting

The BACON article has a version 1 base set that uses standard maha distance with mean and pearson covariance.
Billor, Nedret, Ali S. Hadi, and Paul F. Velleman. 2000. “BACON: Blocked Adaptive Computationally Efficient Outlier Nominators.” Computational Statistics & Data Analysis 34 (3): 279–98. https://doi.org/10.1016/S0167-9473(99)00101-2.

aside:
r6pack includes some kind of reweighting step (that I'm not able to replicate).
I don't see a reference for it, and I have not seen the reweighting step is in the original DetXxx articles,

In my current version my 6pack (or 4 pack right now) differs from the R version in around 4 to 7 indices, but r6pack also excludes all influential points (1:14) including good influential points.

I'm giving up trying to match R Det starting sets exactly. Plus I will keep some of the extra starting sets in _cov_starting.

@josef-pkt
Copy link
Member Author

more local search:

in the hbk example include_endog in starting maha distances improves the estimate, M-scale is lower than exog only.

mod_smm = RLMDetSMM(dta_hbk["Y"], exog_df, include_endog=True)
res_smm = mod_smm.fit(h=40)

By default we could use both types of starting sets, exog only and [endog, exog]. Then I need trivariate option for
start data "exog", "endog", "both".
This will increase the chance of finding an (almost) global optimum and still be much cheaper than FastXxx.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 25, 2024

Problem: What if we have only one regressor besides constant?

It breaks in my current (uncommitted) version, e.g. in spearmanr starting set.
In general the sixpack has several that are only correlation, i.e. diagonal is 1. Spearmanr in my scipy version raises exception.
In other cases (don't remember where) I had used returning only corr coefficient if we have only two variables, instead of 2 by 2 cov matrix.

Example Hertzsprung Russel diagram in notebook https://www.statsmodels.org/dev/examples/notebooks/generated/robust_models_1.html
which is also used in R robustbase lmrob docstring

needs to be fixed for RLMDetS, and maybe special cased or raising exception in covariance.DetMCD and similar.

It looks like my current RLMDet models need at least 3 slope variables in the starting set computation.

problem with scipy, I'm using '1.7.3'

spearmanr has exception, 1 value or cov matrix depending on number of columns

stats.spearmanr(np.random.randn(50, 3))
SpearmanrResult(correlation=array([[ 1.        , -0.11308523,  0.29392557],
       [-0.11308523,  1.        ,  0.13939976],
       [ 0.29392557,  0.13939976,  1.        ]]), pvalue=array([[0.        , 0.43426076, 0.03828331],
       [0.43426076, 0.        , 0.33429506],
       [0.03828331, 0.33429506, 0.        ]]))

stats.spearmanr(np.random.randn(50, 2))
SpearmanrResult(correlation=-0.03442977190876351, pvalue=0.8123713972775561)

stats.spearmanr(np.random.randn(50, 1))
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_2228\3897072665.py in <module>
----> 1 stats.spearmanr(np.random.randn(50, 1))

~\Downloads\WPy64-39100\python-3.9.10.amd64\lib\site-packages\scipy\stats\stats.py in spearmanr(a, b, axis, nan_policy, alternative)
   4479 
   4480     if axisout == 0:
-> 4481         if (a[:, 0][0] == a[:, 0]).all() or (a[:, 1][0] == a[:, 1]).all():
   4482             # If an input is constant, the correlation coefficient
   4483             # is not defined.

IndexError: index 1 is out of bounds for axis 1 with size 1

I guess I can fix the case for k = 2, e.g. in spearmanr,
and add special code for k=1, e.g. h set closest to median, smallest abs(x - median(x)), scale does not matter.

update I replaced scipy spearmanr but new stats.covariance.corr_rank which always returns 2-dim matrix
Then case k=2 works.

Case k=1 without special casing now fails in cov_iter starting set computation

For this case I could get two starting sets

  • smallest abs(x - median(x)) (metric trimming)
  • and equal tail trimming: take h // 2 observations from both sides of median, i.e. quantiles, adjusted for count=h.

even if k=1 case, i.e. only 1 exog slope variable, we could still include endog to get to k=2 in dataset for finding starting sets.

another case: k=0, i.e. constant only regression
What do we do?
There are no exog outlier.
Any "outlier" ranking of exog would be arbitrary and not depend on exog values.
But we still have multiple local optima with redescending estimators, e.g. 3 clusters of y, middle cluster is small.
What's the S-estimator for the mean?
standard S-estimator starting with median (and possibly quantiles) and qn_scale (or mad, scale_tau, ...)

Possible problems with 3 or more clusters.
What if the median is not in the main cluster?
E.g. two outside x or/and y clusters with the "outlier" cluster in the middle.

That's not really the use case for "robust" including "resistant", we assume that our model is for the central data.
Identifying and estimating for several clusters is a different issue, e.g. mixed model, cluster analysis, ...
e.g. if no cluster has 50% of the observations.

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 25, 2024

(faking MM by adding two random exog) now replaced

This includes MM results of the model with 1 slope parameter, but including endog in the start maha computation

image

@josef-pkt
Copy link
Member Author

problem for citing RLMDetS and RLMDetMM

I don't find a reference for it. All the printed articles that I used, are for multivariate location and scatter, and I don't find an article directly for the regression version.
DetR package also does not have any citation directly related to deterministic starts in regression.
There is a conference abstract that includes DetLTS, but I don't find an article for it. https://ww2.amstat.org/meetings/JSM/2010/onlineprogram/AbstractDetails.cfm?abstractid=307805

There is an article (*) for multivariate regression based on MCD scatter matrices in the linear moment conditions.
(Article is older than DetMCD).

Maybe I saw detxxx regression in comments in articles, but I need to go through all the articles again.

(*) Rousseeuw, Peter J., Stefan Van Aelst, Katrien Van Driessen, and Jose Agulló. 2004. “Robust Multivariate Regression.” Technometrics 46 (3): 293–305.

@josef-pkt
Copy link
Member Author

josef-pkt commented May 1, 2024

notebook draft for S- and MM-regression, based on current (uncommitted) code
https://gist.github.com/josef-pkt/dd18a8ec5e968a3e4c426c98a8aa43d0

strange, the "raw" cell with the results from R does not show up in the gist
fixed now by converting raw to markdown cell

@josef-pkt
Copy link
Member Author

I still have problems with norm class versus norm instance as argument to function and model __init__.
continuation of #9186 (comment)

The inplace modification of a user provided norm instance to adjust tuning parameter sounds much too fragile.
But if I don't have an instance as an argument, then I need the extra keyword parameters.

current problem: what should the norm be in RLMDetS, string, class or instance?

    def __init__(self, endog, exog, norm=None, breakdown_point=0.5,
                 col_indices=None, include_endog=False):

If string or class, then I need kwds_norm
If instance, then I need to set the tuning parameter, if breakdown_point is not None.

related problem: RLMDetSMM needs two instances of the norm with different tuning parameters. So, I need to be able to create a new instance with the same kwd arguments but different tuning.

    def __init__(self, endog, exog, norm=None, efficiency=0.95, 
                 breakdown_point=0.5, col_indices=None, include_endog=False):

One possibility is to add clone or copy to create new instances that can be modified.
This will add even more code just to avoid the kwds_norm.
But there is also still the problem that tuning parameters have inconsistent naming across norms. i.e. we don't know generically the keyword name(s) for the norm parameters. :(

solution for now:
I will add a "clone", not "inplace" option to _set_tuning_param. This method has all the norm specific information and we don't need generic copy/clone.
def _set_tuning_param(self, c, inplace=False)

I'm currently only working with TukeyBiweight for DetS and DetMM. Other redescending norms can be added once the design has mostly settled.

@josef-pkt
Copy link
Member Author

oops, unit test for tools fail.

inplace=False as default is backwards incompatible with robust.tools.
So, I will use for now the default inplace=True. (amending last commit and force pushing)
However, this should change back to False, when all norm classes have the keyword and can do inplace=False.

@josef-pkt
Copy link
Member Author

some notes on CovM, similar to enhanced RLM (with M-scale, usage for delegated to by S-estimator)

  • weight and scale options, similar to M-estimators that I have already in functions, however several versions how mean and scatter norms or weights are linked
    • for S-estimator, common norm: weight function for mean and shape and norm for scale (what I have in draft for usage by CovDetS)
    • M-estimator with common norm, objective function, e.g. Kent, Tyler 1991, uniqueness for weakly redescending norms, t(df>=1) has unique scatter
    • M-estimator of shape/scatter: scale not estimated, e.g. fixed to 1
    • shape estimation with "ad-hoc" scaling, e.g. median maha
  • general M-estimation: norm for mean and scatter/scale can differ.

I might need a method option method="S" "M", "MM", with maybe extra option about scaling of scatter matrix, if not implied by method.

Problem: I still don't know how to compute efficiency for multivariate case to get the tuning parameter for desired efficiency.
Also, which efficiency, mean or shape, or even cov (which also depends on scale estimator).

Given this CovM, CovS only needs to handle starting sets and CovMM is just a call to CovM with fixed scale.
same pattern as for RLM, RLMDetS and RLMDetMM.
minimal code, but not tuned for computational efficiency and speed.

@@ -1,6 +1,6 @@
import numpy as np

# TODO: add plots to weighting functions for online docs.
from . import tools as rtools

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
statsmodels.robust.tools
begins an import cycle.
def tuning_s_cov(norm, k_vars, breakdown_point=0.5, limits=()):
"""Tuning parameter for multivariate S-estimator given breakdown point.
"""
from .norms import TukeyBiweight # avoid circular import

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
statsmodels.robust.norms
begins an import cycle.
@@ -28,10 +29,16 @@
import numpy as np
from scipy import stats, linalg
from scipy.linalg.lapack import dtrtri
from .scale import mad
from .scale import mad, qn_scale, _scale_iter

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'qn_scale' is not used.
Import of '_scale_iter' is not used.
@josef-pkt
Copy link
Member Author

aside: rrcov CovSest and CovMMest only allows for biweight and Rocke norms.
CovMMest docstring does not mention which norm, but I guess it is implied by the Sest options.

currently I'm only working with biweight, but intend to add norm options.

@josef-pkt
Copy link
Member Author

oops: Tyler constrained M-estimation requires scale > scale_S (not what I initially thought and used that scale < scale_S)

rho is an increasing function
increasing scale lowers rho(u / s), so lower rho value satisfies inequality constrained rho(u / s) <= s_0

However weights are a decreasing function w(|u| / s)
increasing scale increases weights, and becomes more inclusive
lowering scale lowers weights and becomes more restrictive

However, CM has a different objective from my MM scale adjustment.
CM wants to increase efficiency, i.e. reduce weights and include more points
MM with lower scale wants to find a better local solution without giving up the constrained that we increase scale beyond the MM with fixed scale_0 which would have guaranteed breakdown point.

maybe: to make my MM scaling theoretically clean, I could just re-estimate the MM-estimate with a new S-estimator start given by the MM parameter estimates. An extra detour step as guarantee to have the theoretical assertion on maintaining the breakdown point.

(all because I don't know what the breakdown point is if rho_mean and rho_scale have different breakdown points. Huber has an article but not directly with the result, and/or I don't understand enough to figure it out.)

@josef-pkt
Copy link
Member Author

Kudraszow, Nadia L., and Ricardo A. Maronna. 2011. “Estimates of MM Type for the Multivariate Linear Model.” Journal of Multivariate Analysis 102 (9): 1280–92. https://doi.org/10.1016/j.jmva.2011.04.011.

has table 1 tuning parameter for breakdown point and
table 2 tuning parameter for given efficiency of mean parameter estimates
formula for ARE of mean is in equ. 6.6 which has 3 expectations, i.e. 3 integrals.

table 1 is the same as I have for bp=0.5 and k=3
however, for 95% efficiency at k=3 KM have c=5.48 in table 2 and rrcov CovMMest uses 5.49024917207033

CovMMest has c = 6.09626 if 95% efficiency for shape is requested (default)

(Maronna uses rho normalized to max rho = 1. That does not affect the rho function, but it's in terms of u/c.)

@josef-pkt
Copy link
Member Author

josef-pkt commented May 11, 2024

CovMCD is unfinished and not working correctly yet (does not replicate R), I got distracted by CovM, CovS, covMM
CovMCD methods are just copy-pasted function from my experimental code.

methods take data, data is not in __init__. We need that at least for the helper methods to compute iteration for different starts and subsets of data.
The fit method could use data provided by __init__
This way CovMCD would be like a standard model and like CovM, and we hide the computation with multiple starts in private methods.
In contrast, RLMDetS just delegates the computation to RLM and creates a new RLM instance for every start. For MCD we don't have a different model class to delegate to.
CovDetS will be able to delegate to CovM.

update
stupid mistake call to maha function in CovDetMCD is wrong. The mahalanobis function does not take mean as argument
call is d = mahalanobis(x - mean, cov) instead of
d = mahalanobis(x, mean, cov)

better change signature of maha function, I make the same mistake also in interactive work.

My current uncommitted CovDetMCD now produces the same results as rrcov
CovMcd(x = hbk, raw.only = TRUE, nsamp = "deterministic", use.correction=FALSE)


if ((cov - cov_new)**2).mean() < tol:
cov = cov_new
converged = True

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable converged is not used.
"""

x = self.data
nobs, k_vars = x.shape

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning

This assignment to 'nobs' is unnecessary as it is
redefined
before this value is used.
"""

x = self.data
nobs, k_vars = x.shape

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning

This assignment to 'k_vars' is unnecessary as it is
redefined
before this value is used.
@josef-pkt
Copy link
Member Author

a bit strange:
default maxiter in refinement step for CovDetMCD was hardcoded at 100.
If I fix it and reduce maxiter_step to recommended 2, then unit test fails.
It looks like my current implementation requires large maxiter for every starting set.

one possible reason: Currently I only iterate the one best start to large maxiter. DetMCD article uses 2 best starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants