Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider penalty scaling for SLOPE #11

Open
jolars opened this issue Jul 7, 2020 · 5 comments
Open

Reconsider penalty scaling for SLOPE #11

jolars opened this issue Jul 7, 2020 · 5 comments
Labels
discussion help wanted Extra attention is needed

Comments

@jolars
Copy link
Owner

jolars commented Jul 7, 2020

In SLOPE version 0.3.0 and above, the penalty in the SLOPE objective is scaled depending on the type of scaling that is used in the call to SLOPE(). The behavior is:

  • for scaling = "l1", no scaling is applied
  • for scaling = "l2", the penalty is scaled with sqrt(n)
  • for scaling = "sd", the penalty is scaled with n`.

There are advantages and disadvantages of doing this kind of scaling, and I think a discussion is warranted regarding what the correct behavior should be.

Pros

  • Regularization strength is independent from the number of observations, which means that the same level of regularization is applied over, for instance, differently sized resamples in cross-validation or when fitting a trained model on a test data set.
  • Scaling the penalty is standard practice in many implementations of l1-regularized models, such as glmnet, ncvreg, biglasso
  • Having regularization strength independent from the number of observations means that the model can still control for misspecification as n becomes large.

Cons

  • The fact that the penalty scaling differs depending on type of standardization can be confusing.
  • Overfitting becomes less and less of an issue as n becomes larger, so it makes sense to decrease the regularization strength as n grows.
  • The model definition is now somewhat different from the definitions used in almost all publications, which also means that the interpretation of the alpha parameter as variance in the orthogonal X case is lost.

Possible solutions

Whichever way we go with this, I think we should keep the other option available as a toggle, i.e. add an argument along the lines of penalty_scaling to turn off/on penalty scaling, or even to provide a more fine-grained type of penalty scaling. That way, it would be possible to achieve either behavior, which, really, means that this discussion is really about what the default should be.

Thoughts? Ideas?

References

Hastie et al. (2015) mentions that scaling with n is "useful for cross-validation" and makes lambda values comparable for different sizes of samples, but otherwise doesn't seem to mention it.

  • Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The lasso and generalizations (1 edition). Chapman and Hall/CRC.

scikit-learn has a brief article covering these things here: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html

@jolars jolars added help wanted Extra attention is needed discussion labels Jul 7, 2020
@JonasWallin
Copy link
Collaborator

As default I would use the same as glmnet?
I agree that it should def be an option.
Could you put in some references to what people are doing at different places?

@jolars
Copy link
Owner Author

jolars commented Jul 7, 2020

As default I would use the same as glmnet?
I agree that it should def be an option.
Could you put in some references to what people are doing at different places?

I updated the post with a couple of references, but I'm having a hard time finding references on this.

@JonasWallin
Copy link
Collaborator

Could you start an overleaf of this also?
We should write down the equations so one can have clearer disccusion about them.
Further the naming should be on the scaling not loss function, in my oponion.
I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?

@jolars
Copy link
Owner Author

jolars commented Jul 8, 2020

Could you start an overleaf of this also?
We should write down the equations so one can have clearer disccusion about them.

Yes, absolutely.

Further the naming should be on the scaling not loss function, in my oponion.
I.e. 'l1' should be 'none', then if we have 'l1' loss implemented we should say that default there is none?

not exactly sure what you mean here

@JonasWallin
Copy link
Collaborator

not exactly sure what you mean here

scaling = "l1", no scaling is applied.
The scaling is should not be named after lose function so rather.
scaling = 'none'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants