Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Requests #54

Open
azev77 opened this issue Mar 2, 2020 · 3 comments
Open

Feature Requests #54

azev77 opened this issue Mar 2, 2020 · 3 comments
Projects

Comments

@azev77
Copy link

azev77 commented Mar 2, 2020

@tlienart

  1. All Subset (Best K Subset):
    I've got parsimonious code for All Subset Regression which might have a nice home in this package that I'd like to share.
    While it's much faster than R's Leaps pkg, I would include a strong warning to users:
    This optimization problem is sooo not convex. Avoid for p>20.

  2. Approximate All Subset (MIO):
    As you probably know,
    Bertsimas etal (2016): show how to use MIO to approximately train all subset regression for p =1000 in minutes.
    Bertsimas et al (2019): do this for p=100k in minutes.
    The corresponding Julia package is SubsetSelection.jl.
    Also see forthcoming Julia package MIP-Boost.
    It would be really awesome if these linear models were featured in MLJ.

  3. Relaxed Lasso:
    Hastie et al 2017: responded that relaxed Lasso matches Lasso in low-SNR scenarios & beats All Subset in high-SNR scenarios
    glmnet: now features "relaxed Lasso"
    Would this be difficult to add to MLJLinearModels? (maybe we can work on it together?)

  4. Linear Interactions:
    There are some cool new packages for interactions in linear models.
    I find these compare in out-of-sample fit w/ non-linear models (xgboost etc, no joke)
    R: sprintr & HierNet
    Python: HierScale

  5. Nonconvex penalties (MCP SCAD):
    https://github.com/joshday/SparseRegression.jl/blob/master/README.md

  6. Forward selection/stepwise regression:
    (stackoverflow, has Julia code)
    Textbook code

  7. Ensembles of Penalized Linear Models:
    Christidis et al 2019: Paper, old code EnsembleEN, new code SplitReg

@tlienart
Copy link
Collaborator

tlienart commented Mar 4, 2020

It would be fantastic to have someone else to help me with this package and grow some (possibly experimental) functionalities.

I'm committed to maintaining the package but if you would like to join the team, it would be great! and the things that you're mentioning would definitely be very nice additions

@azev77
Copy link
Author

azev77 commented Mar 10, 2020

Here is the simplest Best Subset regression:

using Combinatorics; #to generate all subsets
#g() trains every subset
@inline function g(X, y, train, test, n_ops, subs, res)
    @inbounds for m = 1:n_ops
        b  = X[train, subs[m]] \ y[train];
        ê  = X[test,  subs[m]]*b - y[test];   #oos residuals ê=y-ŷ
        res[m] ='ê;
    end
    return res
end
#f() trains using all columns of X
@inline function f(X, y, train, test)
    b  = X[train, :] \ y[train];
    ê  = X[test,  :]*b - y[test];   #oos residuals ê=y-ŷ
    sse ='ê;              #SSE
    return sse
end

#Test Drive
using MLJ;
X, y =  @load_boston;
Xnames = keys(X);
X = hcat(X...); #convert to array.
train, test = partition(eachindex(y), .7, rng=333);
#
p = size(X,2); #p is the only HP
n_ops = 2^p -1; res = (1e+5)*ones(n_ops);
subs = collect(powerset(1:p, 1))                 #combinatorics reduce dependency 
@time f(X, y, train, test)
@time g(X, y, train, test, n_ops, subs, res)
extrema(res)
subs[argmin(res)] #variables of best LM
Xnames[ subs[argmin(res)] ]

#using BenchmarkTools
#@benchmark f(X,y)
#@benchmark g(X,y, n_ops, subs, res)

Of course this needs to be cross-validated etc.
This is a valuable learning tool to remind students that the model w/ the most variables won't necessarily have the best out of sample predictive power...
Also, this is a HELLUVALOT slower in R.

The other valuable lesson from this is that it's impossible to train all models.
The Boston data is a toy example w/ p=12 (>4000 models) that can be done in seconds.
But the Ames housing data w/ p=79, (2^79 models) would take lifetimes.

@tlienart
Copy link
Collaborator

Thanks a lot for this! I'm a bit swamped at the moment but will try to look into this soon. And if anything it can make a nice tutorial too!

@ablaom ablaom added this to tracking/discussion/metaissues/misc in General Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
General
tracking/discussion/metaissues/misc
Development

No branches or pull requests

2 participants