AutoML: use separate CV split for ensembling #1746

angela97lin · 2021-01-26T22:17:54Z

#1732 updates our stacked ensembling pipeline to use the same data splitter that is used in AutoML. However, @rpeck noted that we may not want to do this. We continued with #1732 because we believed that it was still an improvement on our current approach (scikit-learn default).

This issue tracks long-term updates we may want to make to our data splitter for stacking in AutoML.

Update: while continuing to update #1732, we ran into a conundrum with the interaction between stacking and AutoML that made us revisit whether it really was a good idea to use the same data split for the stacking ensemble as we use in AutoML. We decided no, and that as Raymond had pointed out, we probably want to use a separate CV for our ensemble. (@dsherry also mentioned a good nugget of info that using ensembling allows the model to put less excessive importance on the more complex models, so CV probably helps with that--please correct me if I've paraphrased incorrectly 😂 ).

Rather than continuing that work then, we should use this issue to discuss updating AutoML for stacking: Specifically, we should create a separate split for CV for the stacked ensembling. This would be similar to what we currently have in place for binary-threshold tuning.

The text was updated successfully, but these errors were encountered:

dsherry · 2021-01-27T21:27:09Z

Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV

It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.

RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:

"By using the cross-validated [predictions,] stacking avoids giving unfairly high weight to models with higher complexity." AKA overfitting
"The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model. The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model. Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset." -- blog post
"It is important that the meta-learner is trained on a separate dataset to the examples used to train the level 0 models to avoid overfitting." -- another blog post
Original paper abstract which discusses how stacked ensembling can be viewed as a generalization of cross-validation
I also found this to be a good read.

@rpeck FYI, after some spinning we are following your suggestion 😆

bchen1116 · 2021-02-11T18:39:16Z

@dsherry @rpeck @angela97lin I started looking at this issue, but it seems like sklearn's StackedClassifier and StackedRegressor classes do use internal cross validation during the training of the model in order to prevent overfitting. This seems to be the same problem that we are trying to solve with this issue, so it seems like it should be resolved. I don't think we'll need to do a separate CV fold for training/validating the Stacked Ensembling methods, but what do you all think?

bchen1116 · 2021-02-24T22:53:09Z

After discussion with @dsherry, here's the idea we want to proceed with

dsherry · 2021-02-25T19:36:47Z

Plan discussed with @bchen1116 : this issue tracks:

Create separate split for training the metalearning for ensemble pipelines
Continue to use sklearn impl for stacked ensembling

Separate performance enhancement: better support for small data. Don't create separate split for ensemling. Use out-of-sample pipeline preds from normal CV (from across all the CV folds) to train the ensembler. #1898

Another separate performance enhancement: train the pipelines and the metalearner on different data. #1897

angela97lin added the enhancement An improvement to an existing feature. label Jan 26, 2021

angela97lin mentioned this issue Jan 26, 2021

Use same CV data splitter for stacked ensembling model created for AutoML #1732

Closed

angela97lin changed the title ~~CV split for stacking ensemble in AutoML~~ Create separate CV split for stacking ensemble in AutoML Jan 27, 2021

dsherry changed the title ~~Create separate CV split for stacking ensemble in AutoML~~ AutoML: use separate CV split for ensembling Jan 28, 2021

dsherry added the performance Issues tracking performance improvements. label Jan 28, 2021

dsherry added this to the Sprint 2021 Feb A milestone Jan 28, 2021

dsherry assigned bchen1116 Feb 9, 2021

bchen1116 mentioned this issue Feb 10, 2021

Separate CV for Stacked Ensembler in AutoMLSearch #1814

Merged

dsherry modified the milestones: Sprint 2021 Feb A, Sprint 2021 Feb B Feb 23, 2021

dsherry mentioned this issue Feb 25, 2021

Small data support for ensembles: use out-of-sample preds to train metalearner #1898

Closed

bchen1116 closed this as completed in #1814 Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoML: use separate CV split for ensembling #1746

AutoML: use separate CV split for ensembling #1746

angela97lin commented Jan 26, 2021 •

edited

dsherry commented Jan 27, 2021

bchen1116 commented Feb 11, 2021

bchen1116 commented Feb 24, 2021

dsherry commented Feb 25, 2021 •

edited

AutoML: use separate CV split for ensembling #1746

AutoML: use separate CV split for ensembling #1746

Comments

angela97lin commented Jan 26, 2021 • edited

dsherry commented Jan 27, 2021

bchen1116 commented Feb 11, 2021

bchen1116 commented Feb 24, 2021

dsherry commented Feb 25, 2021 • edited

angela97lin commented Jan 26, 2021 •

edited

dsherry commented Feb 25, 2021 •

edited