Feature request for sequential feature selector: random subsets at each step #47

sergeyf · 2016-04-23T22:52:16Z

Hello,

First of all: thanks for the great package! I have gotten a lot of good use out of it, especially the sequential feature selection.

SFS becomes problematic as the number of features d increases, since the complexity grows as O(d^2). I have found that one way to deal with this is to take a random subset of the remaining dimensions to check at each step instead of trying all of them. If the random subset has size k then the complexity goes down to O(dk).

Take an example of sequential forward selection with d=1000 and k=25.

During the first step, we can either try all 1000 univariate models or pick a random subset of 25 univariate models, and then take the best of them. It makes sense to try them all so as to start with a good baseline.

During the second step, instead of trying 999 bivariate models, we try only 25 of them.

Then 25 instead of 998 trivariate models. And so on until we have 25 left, at which point we revert to trying them all.

If you're interested in some empirical results, I wrote a blog post about this a while back: http://blog.explainmydata.com/2012/07/speeding-up-greedy-feature-selection.html

This would be a great feature to have!

rasbt · 2016-07-04T17:12:58Z

Hi, Sergey,
thanks for the suggestion and I totally agree with you hear: SFS is darn expensive (O(n^2)) when it comes to high dimensional datasets. The random subsampling sounds like a neat idea, and it shouldn't be too hard to add it to the existing implementation. I am a bit busy at the moment, but I will add it to my todo list for when I am back from SciPy :)!

Btw. it reminds me a bit of Bergstra & Bengio's paper on Random hyperparameter search (Random Search for Hyper-Parameter Optimization; http://www.jmlr.org/papers/v13/bergstra12a.html). I.e., that 60 random points fall within the true maximum with 95% probability. Or if each draw has a 5% chance to be in the true-maximum interval, we have
1 - (1 - 0.05)^n > 0.95 and n >= 60.

sergeyf · 2016-07-04T22:21:18Z

Sounds great! I've been using this in my own mini-implementation of SFS and it seems to work well.

rasbt added the Enhancement label Jul 4, 2016

rasbt added the easy label Jul 4, 2016

rasbt added the stat479 label Sep 27, 2019

xliu833 linked a pull request Dec 22, 2019 that will close this issue

Random subsampling for feature selection #47 #650

Open

5 tasks

rasbt removed the stat479 label Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request for sequential feature selector: random subsets at each step #47

Feature request for sequential feature selector: random subsets at each step #47

sergeyf commented Apr 23, 2016

rasbt commented Jul 4, 2016

sergeyf commented Jul 4, 2016

Feature request for sequential feature selector: random subsets at each step #47

Feature request for sequential feature selector: random subsets at each step #47

Comments

sergeyf commented Apr 23, 2016

rasbt commented Jul 4, 2016

sergeyf commented Jul 4, 2016