Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML: use separate CV split for ensembling #1746

Closed
angela97lin opened this issue Jan 26, 2021 · 4 comments · Fixed by #1814
Closed

AutoML: use separate CV split for ensembling #1746

angela97lin opened this issue Jan 26, 2021 · 4 comments · Fixed by #1814
Assignees
Labels
enhancement An improvement to an existing feature. performance Issues tracking performance improvements.

Comments

@angela97lin
Copy link
Contributor

angela97lin commented Jan 26, 2021

#1732 updates our stacked ensembling pipeline to use the same data splitter that is used in AutoML. However, @rpeck noted that we may not want to do this. We continued with #1732 because we believed that it was still an improvement on our current approach (scikit-learn default).

This issue tracks long-term updates we may want to make to our data splitter for stacking in AutoML.


Update: while continuing to update #1732, we ran into a conundrum with the interaction between stacking and AutoML that made us revisit whether it really was a good idea to use the same data split for the stacking ensemble as we use in AutoML. We decided no, and that as Raymond had pointed out, we probably want to use a separate CV for our ensemble. (@dsherry also mentioned a good nugget of info that using ensembling allows the model to put less excessive importance on the more complex models, so CV probably helps with that--please correct me if I've paraphrased incorrectly 😂 ).

Rather than continuing that work then, we should use this issue to discuss updating AutoML for stacking: Specifically, we should create a separate split for CV for the stacked ensembling. This would be similar to what we currently have in place for binary-threshold tuning.

@dsherry
Copy link
Contributor

dsherry commented Jan 27, 2021

Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV

It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.

RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:

  • "By using the cross-validated [predictions,] stacking avoids giving unfairly high weight to models with higher complexity." AKA overfitting
  • "The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model. The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model. Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset." -- blog post
  • "It is important that the meta-learner is trained on a separate dataset to the examples used to train the level 0 models to avoid overfitting." -- another blog post
  • Original paper abstract which discusses how stacked ensembling can be viewed as a generalization of cross-validation
  • I also found this to be a good read.

@rpeck FYI, after some spinning we are following your suggestion 😆

@angela97lin angela97lin changed the title CV split for stacking ensemble in AutoML Create separate CV split for stacking ensemble in AutoML Jan 27, 2021
@dsherry dsherry changed the title Create separate CV split for stacking ensemble in AutoML AutoML: use separate CV split for ensembling Jan 28, 2021
@dsherry dsherry added the performance Issues tracking performance improvements. label Jan 28, 2021
@dsherry dsherry added this to the Sprint 2021 Feb A milestone Jan 28, 2021
@bchen1116
Copy link
Contributor

@dsherry @rpeck @angela97lin I started looking at this issue, but it seems like sklearn's StackedClassifier and StackedRegressor classes do use internal cross validation during the training of the model in order to prevent overfitting. This seems to be the same problem that we are trying to solve with this issue, so it seems like it should be resolved. I don't think we'll need to do a separate CV fold for training/validating the Stacked Ensembling methods, but what do you all think?

image

@bchen1116
Copy link
Contributor

After discussion with @dsherry, here's the idea we want to proceed with

@dsherry
Copy link
Contributor

dsherry commented Feb 25, 2021

Plan discussed with @bchen1116 : this issue tracks:

  • Create separate split for training the metalearning for ensemble pipelines
  • Continue to use sklearn impl for stacked ensembling

Separate performance enhancement: better support for small data. Don't create separate split for ensemling. Use out-of-sample pipeline preds from normal CV (from across all the CV folds) to train the ensembler. #1898

Another separate performance enhancement: train the pipelines and the metalearner on different data. #1897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. performance Issues tracking performance improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants