HistGradientBoosting avoid data shuffling when early_stopping activated #25460

aldder · 2023-01-23T17:32:50Z

hello, it would be useful if the HistGradientBoostingRegressor or HistGradientBoostingClassifier model had the ability to avoid data shuffling when using the early_stopping and validation_fraction parameters, since maintaining data order is a basic requirement in case you work with TimeSeries

https://github.com/scikit-learn/scikit-learn/blob/98cf537f5/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py#L427

            if sample_weight is None:
                X_train, X_val, y_train, y_val = train_test_split(
                    X,
                    y,
                    test_size=self.validation_fraction,
                    stratify=stratify,
                    random_state=self._random_seed,
                )
                sample_weight_train = sample_weight_val = None
            else:
                # TODO: incorporate sample_weight in sampling here, as well as
                # stratify
                (
                    X_train,
                    X_val,
                    y_train,
                    y_val,
                    sample_weight_train,
                    sample_weight_val,
                ) = train_test_split(
                    X,
                    y,
                    sample_weight,
                    test_size=self.validation_fraction,
                    stratify=stratify,
                    random_state=self._random_seed,
                )

Describe your proposed solution

it would be sufficient to add an additional parameter to control whether or not to shuffle the data

Describe alternatives you've considered, if relevant

No response

Additional context

No response

glemaitre · 2023-01-23T18:05:38Z

Indeed, it could even be considered a methodological bug since the obtained model will not be viable. I am not sure what is the best approach here to solve the issue. Adding shuffle is one but I recall that we thought about early-stopping when playing with callbacks (ping @jeremiedbb). Through the callbacks, we could provide a given train-test split that would be designed for the application purpose.

shamzos · 2023-01-24T06:14:43Z

can I work on this one?

glemaitre · 2023-01-24T09:58:29Z

@shamzos as said in my previous message, it is not clear to me what is the way forward. I would wait for the comment of @ogrisel and @jeremiedbb

ogrisel · 2023-01-24T10:17:24Z

The callback API discussed by @glemaitre is being drafted in #22000. This work is paused because it's indeed quite complex to get the API right to allow adding enough flexibility in early stopping data splits while still being intuitive to use, in particular with nesting cross-validation for model selection and evaluation...

lorentzenchr · 2023-08-20T15:19:10Z

Why not just add the parameters shuffle and stratify to the estimators and pass them to the train_test_split?

ArturoAmorQ · 2023-10-04T15:46:58Z

I will submit a PR to introduce a shuffle parameter. In that case we can choose to automatically set stratify=None if shuffle=True.

lorentzenchr · 2023-10-04T17:09:53Z

There is a related discussion in #18748 where the conclusion is to add X_val and y_val as arguments of fit. This avoids to add all the splitter options in HGBT where they don’t belong.

aldder added Needs Triage Issue requires triage New Feature labels Jan 23, 2023

glemaitre added Bug and removed New Feature Needs Triage Issue requires triage labels Jan 23, 2023

thomasjpfan added module:ensemble Needs Decision - Include Feature Requires decision regarding including feature and removed Bug labels Feb 8, 2023

ogrisel mentioned this issue Mar 24, 2023

How to early stop in GradientBoostingClassifer? #25859

Open

ArturoAmorQ mentioned this issue Oct 4, 2023

DOC Add example showcasing HGBT regression #26991

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

aldder commented Jan 23, 2023 •

edited by lorentzenchr

glemaitre commented Jan 23, 2023

shamzos commented Jan 24, 2023

glemaitre commented Jan 24, 2023

ogrisel commented Jan 24, 2023

lorentzenchr commented Aug 20, 2023

ArturoAmorQ commented Oct 4, 2023

lorentzenchr commented Oct 4, 2023

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

HistGradientBoosting avoid data shuffling when early_stopping activated #25460

Comments

aldder commented Jan 23, 2023 • edited by lorentzenchr

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Jan 23, 2023

shamzos commented Jan 24, 2023

glemaitre commented Jan 24, 2023

ogrisel commented Jan 24, 2023

lorentzenchr commented Aug 20, 2023

ArturoAmorQ commented Oct 4, 2023

lorentzenchr commented Oct 4, 2023

aldder commented Jan 23, 2023 •

edited by lorentzenchr