Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: fixed scale in RLM, float scale_est #9211

Closed
josef-pkt opened this issue Apr 16, 2024 · 3 comments
Closed

ENH/BUG: fixed scale in RLM, float scale_est #9211

josef-pkt opened this issue Apr 16, 2024 · 3 comments

Comments

@josef-pkt
Copy link
Member

I thought we have an option already to have fixed scale in RLM, e.g. for MM-estimation with preliminary scale estimate.
However, I cannot find a way to use the fit options for it.
I don't find a direct issue for it, nor a PR. (So I don't know why I had gotten the impression that it works.)

The work around I found is
setting the scale attribute directly and use options that scale is not estimated or updated

mod = RLM.from_formula("Y ~ X1 + X2 + X3", dta_hbk.loc[inliers], M=norm_s)
mod.scale = 0.7963592114
res = mod.fit(scale_est=0.7963592114, init=0.79, update_scale=False)
print(res.summary())

as verification that this works
I get the same parameter estimate as robustbase
res_ss = lmrob.S(res_s$x, hbk[["Y"]], control = lmrob.control(nRes = 20))
if I use the fixed scale (and start_params) from R in RLM.
(However, I'm not getting the same results as the robust base S-estimator if I simultaneously update the scale with biweight rho)

I'm not sure yet which option we want to use to fix the scale
init docstring does not match the code, it is only use for initial scale estimate, but does not affect start params.

        init : str
            Specifies method for the initial estimates of the parameters.
            Default is None, which means that the least squares estimate
            is used.  Currently it is the only available choice.

(context: I'm trying to see whether we can reuse RLM as S-estimator for given "good" start_params. related to issue #9171 and PR #9210)

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 16, 2024

one possibility
deprecate init and replace it by init_scale
then scale_est could be ignored if update_scale=False

The alternative I thought of initially
use scale_est=myscale and then ignore init (or set to False) and automatically set update_scale to False.
or just let _estimate_scale always return the fixed scale.

Aside: Can we get rid of the self.scale attribute and make it a temp variable inside fit?
(one less "state variable")

another possibility
In my current PR, I allow scale_est to be a callable.
So we could just make a dummy function that always returns the same scale.
scale_est = lambda *args **kwargs: s

@josef-pkt
Copy link
Member Author

josef-pkt commented Apr 18, 2024

I'm adding start_scale keyword in analogy to start_params.

Note: init (and other parts) seems to have been designed following R MASS rlm
docstring

init (optional) initial values for the coefficients OR a method to find initial values
OR the result of a fit with a coef component. Known methods are "ls" (the
default) for an initial least-squares fit using weights w*weights, and "lts" for
an unweighted least-trimmed squares fit with 200 samples.

What we could do is add this option to make RLM optionally a proper MM estimator with some 1st step init estimation.

However, I think this gets too messy, with many required options (as e.g. in robustbase lmrob.control) and we have circular model calls (e.g. S-estimator call RLM, and MM estimator calls S-estimator.)

So keep RLM as is and use it as a helper model for the RobustMM, RobustS, RobustDetS, RobustDetMM, ... RobustXxx, or RLMXxx.
The names without qualifier would be for univariate endog linear model, and then add qualifiers for nonlinear and multivariate endog.

In that case we can deprecate init, and user has enough options with start_params, start_scale and update_scale and scale_est.

One possible refactoring would be to make scale_est a RLM __init__ options instead of fit option.
In RLM, or any robust method, the estimation of the mean parameters is always linked to the scale estimate, in contrast to OLS or GLM.

update
One possible simple option for init would be deterministic OLS based on one subset of "inliers", i.e. no search just use one subset based on mahalanobis distance of x or combined [y, x]
(aside:
If exog includes constant, then we need to handle that in the maha distance computation.
If exog includes dummy or categorical variables, we need to check for a full rank subsample. :( or we use pinv that has params=0 for columns of zeros, but regularization of perfect collinearity with const if dummy column is all ones.
)

@josef-pkt
Copy link
Member Author

after the fixed scale options in #9210, we still have a problem with scale_est which is default "mad" even after fixing the scale.

cheap fix: set scale_est to "fixed" if update_scale is False and start_scale is not None.

alternative: try to streamline the scale option, we have no several of those, scale_est, init, update_scale and start_scale
init can be deprecated, that leaves 3 arguments.
If update_scale is True, and (new) start_scale is None, then the scale is fixed at the initial "mad" estimate.
Is this a relevant use case? seems unlikely.

However, we might still want to be able to use start_scale with other scale_est methods, e.g. to avoid initial mad scale.
e.g. scale_est=HuberScale or MScale, update_scale=True and start_scale=my_numers.
Then we still need all 3 options.
I have not checked whether current code actually allows to avoid initial mad scale.

For now, I'm adding the "cheap fix" in #9227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant