Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

Open
aldder opened this issue Jan 23, 2023 · 7 comments
Open

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

aldder opened this issue Jan 23, 2023 · 7 comments
Labels
module:ensemble Needs Decision - Include Feature Requires decision regarding including feature

Comments

@aldder
Copy link

aldder commented Jan 23, 2023

hello, it would be useful if the HistGradientBoostingRegressor or HistGradientBoostingClassifier model had the ability to avoid data shuffling when using the early_stopping and validation_fraction parameters, since maintaining data order is a basic requirement in case you work with TimeSeries

https://github.com/scikit-learn/scikit-learn/blob/98cf537f5/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L427

            if sample_weight is None:
                X_train, X_val, y_train, y_val = train_test_split(
                    X,
                    y,
                    test_size=self.validation_fraction,
                    stratify=stratify,
                    random_state=self._random_seed,
                )
                sample_weight_train = sample_weight_val = None
            else:
                # TODO: incorporate sample_weight in sampling here, as well as
                # stratify
                (
                    X_train,
                    X_val,
                    y_train,
                    y_val,
                    sample_weight_train,
                    sample_weight_val,
                ) = train_test_split(
                    X,
                    y,
                    sample_weight,
                    test_size=self.validation_fraction,
                    stratify=stratify,
                    random_state=self._random_seed,
                )

Describe your proposed solution

it would be sufficient to add an additional parameter to control whether or not to shuffle the data

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@aldder aldder added Needs Triage Issue requires triage New Feature labels Jan 23, 2023
@glemaitre
Copy link
Member

Indeed, it could even be considered a methodological bug since the obtained model will not be viable. I am not sure what is the best approach here to solve the issue. Adding shuffle is one but I recall that we thought about early-stopping when playing with callbacks (ping @jeremiedbb). Through the callbacks, we could provide a given train-test split that would be designed for the application purpose.

@glemaitre glemaitre added Bug and removed New Feature Needs Triage Issue requires triage labels Jan 23, 2023
@shamzos
Copy link

shamzos commented Jan 24, 2023

can I work on this one?

@glemaitre
Copy link
Member

@shamzos as said in my previous message, it is not clear to me what is the way forward. I would wait for the comment of @ogrisel and @jeremiedbb

@ogrisel
Copy link
Member

ogrisel commented Jan 24, 2023

The callback API discussed by @glemaitre is being drafted in #22000. This work is paused because it's indeed quite complex to get the API right to allow adding enough flexibility in early stopping data splits while still being intuitive to use, in particular with nesting cross-validation for model selection and evaluation...

@lorentzenchr
Copy link
Member

Why not just add the parameters shuffle and stratify to the estimators and pass them to the train_test_split?

@ArturoAmorQ
Copy link
Member

I will submit a PR to introduce a shuffle parameter. In that case we can choose to automatically set stratify=None if shuffle=True.

@lorentzenchr
Copy link
Member

There is a related discussion in #18748 where the conclusion is to add X_val and y_val as arguments of fit. This avoids to add all the splitter options in HGBT where they don’t belong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:ensemble Needs Decision - Include Feature Requires decision regarding including feature
Projects
None yet
Development

No branches or pull requests

7 participants