Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request for sequential feature selector: random subsets at each step #47

Open
sergeyf opened this issue Apr 23, 2016 · 2 comments · May be fixed by #650
Open

Feature request for sequential feature selector: random subsets at each step #47

sergeyf opened this issue Apr 23, 2016 · 2 comments · May be fixed by #650

Comments

@sergeyf
Copy link

sergeyf commented Apr 23, 2016

Hello,

First of all: thanks for the great package! I have gotten a lot of good use out of it, especially the sequential feature selection.

SFS becomes problematic as the number of features d increases, since the complexity grows as O(d^2). I have found that one way to deal with this is to take a random subset of the remaining dimensions to check at each step instead of trying all of them. If the random subset has size k then the complexity goes down to O(dk).

Take an example of sequential forward selection with d=1000 and k=25.

During the first step, we can either try all 1000 univariate models or pick a random subset of 25 univariate models, and then take the best of them. It makes sense to try them all so as to start with a good baseline.

During the second step, instead of trying 999 bivariate models, we try only 25 of them.

Then 25 instead of 998 trivariate models. And so on until we have 25 left, at which point we revert to trying them all.

If you're interested in some empirical results, I wrote a blog post about this a while back: http://blog.explainmydata.com/2012/07/speeding-up-greedy-feature-selection.html

This would be a great feature to have!

@rasbt
Copy link
Owner

rasbt commented Jul 4, 2016

Hi, Sergey,
thanks for the suggestion and I totally agree with you hear: SFS is darn expensive (O(n^2)) when it comes to high dimensional datasets. The random subsampling sounds like a neat idea, and it shouldn't be too hard to add it to the existing implementation. I am a bit busy at the moment, but I will add it to my todo list for when I am back from SciPy :)!

Btw. it reminds me a bit of Bergstra & Bengio's paper on Random hyperparameter search (Random Search for Hyper-Parameter Optimization; http://www.jmlr.org/papers/v13/bergstra12a.html). I.e., that 60 random points fall within the true maximum with 95% probability. Or if each draw has a 5% chance to be in the true-maximum interval, we have
1 - (1 - 0.05)^n > 0.95 and n >= 60.

@rasbt rasbt added the easy label Jul 4, 2016
@sergeyf
Copy link
Author

sergeyf commented Jul 4, 2016

Sounds great! I've been using this in my own mini-implementation of SFS and it seems to work well.

@rasbt rasbt added the stat479 label Sep 27, 2019
@xliu833 xliu833 linked a pull request Dec 22, 2019 that will close this issue
5 tasks
@rasbt rasbt removed the stat479 label Jul 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants