Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small data support for ensembles: use out-of-sample preds to train metalearner #1898

Closed
dsherry opened this issue Feb 25, 2021 · 0 comments
Closed
Labels
enhancement An improvement to an existing feature. performance Issues tracking performance improvements.

Comments

@dsherry
Copy link
Contributor

dsherry commented Feb 25, 2021

Problem
In the case when we don't have many samples (<1k total) provided to automl, we should adjust our splitting strategy to maximize our usage of what data we have available.

Background
#1746 tracks creating a separate 20% split for stacked ensemblers.

Proposal
If our data is "small", we should instead do the following:

  • Do not create a separate split for stacked ensemblers
  • Perform normal CV during pipeline training, as we do now
  • When training stacked ensemble method, use the complete set of each pipelines' predictions on each validation split during CV to train the metalearner.

We still need to resolve: what data do we use to validate the ensembler?

@dsherry dsherry added enhancement An improvement to an existing feature. performance Issues tracking performance improvements. labels Feb 25, 2021
@dsherry dsherry closed this as completed Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. performance Issues tracking performance improvements.
Projects
None yet
Development

No branches or pull requests

1 participant