New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoML: use separate CV split for ensembling #1746
Comments
Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled. RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:
@rpeck FYI, after some spinning we are following your suggestion 😆 |
@dsherry @rpeck @angela97lin I started looking at this issue, but it seems like sklearn's |
Plan discussed with @bchen1116 : this issue tracks:
Separate performance enhancement: better support for small data. Don't create separate split for ensemling. Use out-of-sample pipeline preds from normal CV (from across all the CV folds) to train the ensembler. #1898 Another separate performance enhancement: train the pipelines and the metalearner on different data. #1897 |
#1732 updates our stacked ensembling pipeline to use the same data splitter that is used inAutoML
. However, @rpeck noted that we may not want to do this. We continued with #1732 because we believed that it was still an improvement on our current approach (scikit-learn default).This issue tracks long-term updates we may want to make to our data splitter for stacking in AutoML.
Update: while continuing to update #1732, we ran into a conundrum with the interaction between stacking and AutoML that made us revisit whether it really was a good idea to use the same data split for the stacking ensemble as we use in AutoML. We decided no, and that as Raymond had pointed out, we probably want to use a separate CV for our ensemble. (@dsherry also mentioned a good nugget of info that using ensembling allows the model to put less excessive importance on the more complex models, so CV probably helps with that--please correct me if I've paraphrased incorrectly 😂 ).
Rather than continuing that work then, we should use this issue to discuss updating AutoML for stacking: Specifically, we should create a separate split for CV for the stacked ensembling. This would be similar to what we currently have in place for binary-threshold tuning.
The text was updated successfully, but these errors were encountered: