SequentialFeatureSelection Early Stopping Criterion #886

aldder · 2022-02-02T13:44:42Z

Description

According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases.
A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks).
In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.

One parameter has been added in during instantiation of the object:

early_stop_rounds : int (default 0)
    Enable early stopping criterion when > 0, this value determines the
    number of iterations after which, if no performance boost has been
    seen, execution is stopped.
    Used only when `k_features == 'best'` or `k_features == 'parsimonious'`

Code Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

np.random.seed(0)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# add some noise in order to have features to discard
X_iris_with_noise = np.concatenate(
    (X_iris,
    np.random.randn(X_iris.shape[0], X_iris.shape[1])),
    axis=1)

knn = KNeighborsClassifier()

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=0,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=2,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

... Performances not improved for 2 rounds. Stopping now!

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

rasbt · 2022-02-02T14:20:22Z

Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option

k_features="parsimonious"

which will select the smallest feature set within 1 standard error of the best feature set, which helps with this.

I like adding an early_stop option, but I have a few suggestions / concerns regarding the API:

1)

I think that the two parameters

early_stop and early_stop_rounds can be consolidated into a single one. E.g.,

                 if self.early_stop and k != k_to_select:
                     if k_score <= best_score:

could be

                 if self.early_stop_rounds and k != k_to_select:
                     if k_score <= best_score:

What I mean is instead of having

early_stop : bool (default: False)
early_stop_rounds : int (default 3)

this could be simplified to

early_stop_rounds : int (default 0)

2)

The second concern I have is that if a user selects e.g., k_features=(1, 3), early_stop_rounds=3, it's not necessarily guaranteed that there will be 1 to 3 features, which can be confusing.

I wonder if it makes sense to allow early_stopping_rounds only for k_features='best' and k_features='parsimonious', which both explore the whole feature subset size space.

E.g.,

if k_features='best' is selected with early_stop_rounds=0, it will evaluate and select the best feature subset in the range 1 to m.
if k_features='best' is selected with early_stop_rounds=2, it will evaluate and select the best feature subset in the range 1 to m but it may stop early, which means for forward selection, it may not explore the whole feature subset sizes up to m.

If a user selects e.g., k_features=3 and early_stop_rounds=2 we could

a) raise an error saying that f"Early stopping is not supported with fixed feature subset sizes."
b) raise a warning saying f"Early stopping may lead to feature subset sizes that are different from k_features={k_features}."

What are your thoughts?

aldder · 2022-02-02T15:11:26Z

Thanks for your suggestions, I agree with your points.
I will edit this PR with the followings:

have only one parameter early_stop_rounds : int >= 0
enable early stopping only if early_stop_rounds > 0 and k_features in ['best', 'parsimonious']

rasbt · 2022-02-02T15:26:39Z

Sounds great! Looking forward to it!

pep8speaks · 2022-02-02T15:49:06Z

Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-02-03 10:37:22 UTC

rasbt · 2022-02-02T19:37:39Z

mlxtend/feature_selection/sequential_feature_selector.py

+        if not isinstance(early_stop_rounds, int) or early_stop_rounds < 0:
+            raise ValueError('Number of early stopping round should be '
+                             'an integer value greater than or equal to 0.'
+                             'Got %d' % early_stop_rounds)


I think ...Got %d' % early_stop_rounds might not work if early_stop_rounds is not an integer. Maybe it's better to replace it with

...Got %s' % early_stop_rounds

rasbt · 2022-02-02T19:42:27Z

mlxtend/feature_selection/sequential_feature_selector.py

+                # early stop
+                if self.early_stop_rounds \
+                        and k != k_to_select \
+                        and self.k_features in {'best', 'parsimonious'}:


Instead of the check here, i would maybe change it to raising a ValueError in the top of the function if self.k_features is not in {'best', 'parsimonious'} and self.early_stop_rounds. This way the user is aware, and we only have to perform the check once.

@rasbt Do you prefer having this check on top of fit function or during initialization?

Ahh, totally lost track and missed your comment! Sorry! Regarding your question, I think fit might be better to keep it more consistent with scikit-learn behavior.

fix rasbt#886 (comment) a…

jimmy927 · 2023-10-02T13:25:50Z

What is the status on this ?

I use the k_features="parsimonious" on my model.
But it continues to add more and more features even after it is obvious the model will not improve, and after that it will select one of the very early model anywys.

I think this PR could get my runtime from 10 days into hours ;-)

rasbt · 2023-10-14T12:47:27Z

Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this.

aldder added 2 commits February 2, 2022 13:19

adding early stop for sequential feature selection

8bedc45

test and refactoring

fff2163

improvements: rasbt#886 (comment)

f99fec4

refactoring

4db4c97

rasbt requested changes Feb 2, 2022

View reviewed changes

fix rasbt#886 (comment) and tests

d5595a9

aldder added a commit to aldder/mlxtend that referenced this pull request Feb 4, 2022

Merge pull request #3 from aldder/sequentialfeatureselection_earlystop

b2ad0d4

fix rasbt#886 (comment) a…

fix rasbt#886 (comment)

c6e9ca8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequentialFeatureSelection Early Stopping Criterion #886

SequentialFeatureSelection Early Stopping Criterion #886

aldder commented Feb 2, 2022 •

edited

rasbt commented Feb 2, 2022

aldder commented Feb 2, 2022

rasbt commented Feb 2, 2022

pep8speaks commented Feb 2, 2022 •

edited

rasbt Feb 2, 2022

aldder Feb 3, 2022

rasbt Feb 2, 2022

aldder Feb 3, 2022 •

edited

rasbt Feb 12, 2022

jimmy927 commented Oct 2, 2023

rasbt commented Oct 14, 2023

SequentialFeatureSelection Early Stopping Criterion #886

Are you sure you want to change the base?

SequentialFeatureSelection Early Stopping Criterion #886

Conversation

aldder commented Feb 2, 2022 • edited

Description

Pull Request Checklist

rasbt commented Feb 2, 2022

1)

2)

aldder commented Feb 2, 2022

rasbt commented Feb 2, 2022

pep8speaks commented Feb 2, 2022 • edited

Comment last updated at 2022-02-03 10:37:22 UTC

rasbt Feb 2, 2022

Choose a reason for hiding this comment

aldder Feb 3, 2022

Choose a reason for hiding this comment

rasbt Feb 2, 2022

Choose a reason for hiding this comment

aldder Feb 3, 2022 • edited

Choose a reason for hiding this comment

rasbt Feb 12, 2022

Choose a reason for hiding this comment

jimmy927 commented Oct 2, 2023

rasbt commented Oct 14, 2023

aldder commented Feb 2, 2022 •

edited

pep8speaks commented Feb 2, 2022 •

edited

aldder Feb 3, 2022 •

edited