Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate CV for Stacked Ensembler in AutoMLSearch #1814

Merged
merged 31 commits into from Mar 3, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Feb 10, 2021

fix #1746

Perf tests results are here

As @chukarsten brings up here, should we add an ensembling_split arg to AutoMLSearch so that users can define what % of the data they want for the ensembling split? Should we test other split sizes?

@bchen1116 bchen1116 self-assigned this Feb 10, 2021
@codecov
Copy link

codecov bot commented Feb 10, 2021

Codecov Report

Merging #1814 (19ae76e) into main (79e4f75) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1814     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         267      267             
  Lines       21400    21536    +136     
=========================================
+ Hits        21394    21530    +136     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/automl/engine/engine_base.py 100.0% <100.0%> (ø)
evalml/automl/engine/sequential_engine.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
...valml/tests/automl_tests/test_sequential_engine.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79e4f75...19ae76e. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review February 26, 2021 18:54
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Bryan, putting changes requested just so we can have a better discussion when I'm less tired, haha. This looks good, I'd just like to think together about what the engine should be doing vs. automl.

@@ -318,6 +321,10 @@ def __init__(self,
num_ensemble_batches)
else:
self.max_iterations = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self.max_batches - 1))
if run_ensembling:
self.X_train, self.X_ensemble, self.y_train, self.y_ensemble = split_data(self.X_train, self.y_train, problem_type=self.problem_type, test_size=0.20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we good with hardcoding 1/5th ensembling test split for the first crack? I think we should either file a good first issue to give AutoMLSearch() the size of the ensembling split or just do it here. Either way, I think it'll need some checks/logging around what happens if you pass in a ensembling_split_size without ensembling being passed.

Copy link
Contributor Author

@bchen1116 bchen1116 Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten right, I see your point. Through my discussions with @dsherry, we decided that a 0.8/0.2 split seemed fine for separating the ensembling, but I do see your concern about making it more adjustable/customizable. I don't see a problem with adding it as a (private for now) arg to AutoMLSearch, in case a user wants to fine tune it?

What do you think, @dsherry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys chatted about it, let's just file an issue to handle it later, then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this up for discussion for now, in case other people have inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, let's please make this configurable! I think its fine to default it to 20%.

evalml/automl/engine/engine_base.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
@@ -20,7 +20,8 @@ def evaluate_batch(self, pipelines):
while self._should_continue_callback() and index < len(pipelines):
pipeline = pipelines[index]
self._pre_evaluation_callback(pipeline)
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that ensembling related logic falls within the Engine's area of responsibility. I feel like we should keep the engine as responsible for training, fitting, scoring pipelines that are fed to it via train_and_score_pipeline() and freddy's upcoming fit_batch() and evaluate_batch() functionality. I'm not sure - I'm super tired. it just seems weird passing X_ensemble and y_ensemble to train_and_score. @dsherry, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten, do you mean you'd prefer a check like:

# in sequential_engine.py
# while loop over pipeines
if pipeline.model_family == ModelFamily.Ensemble:
     evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble)
else:
     evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)

versus what I currently have? Or do something different altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think I like that. I think at least keeping it out of train and score pipelines will be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that too @bchen1116 !

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think this is great! I think the only thing blocking merge is that the concatenation of X_train and X_ensemble should preserve logical types passed in by the user.

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
@@ -20,7 +21,10 @@ def evaluate_batch(self, pipelines):
while self._should_continue_callback() and index < len(pipelines):
pipeline = pipelines[index]
self._pre_evaluation_callback(pipeline)
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
if pipeline.model_family == ModelFamily.ENSEMBLE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I imagine we'll want to do this for all engines so it might be worth it to add it to EngineBase.evaluate_batch and then determine the dataset to use with super. I don't want to scope creep. Maybe it's best for Karsten to do consolidate the logic once the modifications to SequentialEngine and EngineBase settle down. What are your thoughts @chukarsten @bchen1116 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that works for me. Would have to see what that looks like with the call to super(). Do you mean putting this whole if/else branch in super and swapping the X/y?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trickier than I thought since the logic depends on a single pipeline but the function takes in a list of pipelines. Might be better to encapsulate the logic in a separate method rather than via super. Anyway it may be best to do it later but we should strive to not replicate this in all engines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten @freddyaboulton what if we did something like

# in engine_base.py
# in engine class
def handle_train_and_score_pipeline(pipeline):
    if pipeline.model_family == ModelFamily.ENSEMBLE:
         evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble)
    else:
         evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
    return evaluation_result

Then in SequentialEngine, you can simply call super().handle_train_and_score_pipeline(pipeline)?

This way, it'll be in the base Engine class, although we wouldn't be adding it to EngineBase.evaluate_batch, and it might just be an extra method call that we don't need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I like that! I think we'll need to do re-use the logic for #1729 so it may be best to return the data to use (X_ensemble vs X_train) rather than the evaluation result.

I think it's best for me to handle this in the PR for #1729. I'm convinced this is scope creep now cause the best way to do this depends on requirements not relating to this original issue.

Sorry for the misdirection @bchen1116!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep that makes sense! Thanks for pointing that out!

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Just had a comment for my own understanding hehe

if ensembling:
assert automl.ensembling
# check that the X_train data is all used for the length
assert len(X_train) == (len(mock_fit.call_args_list[-2][0][0]) + len(mock_score.call_args_list[-2][0][0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to follow along, but why are we checking this against the length of the mock_fit and length of the mock_score args? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks that the second-to-last pipeline (which isn't ensembling) uses the full X_train dataset between the fit and score calls. The mock_fit.call_args_list[-2][0][0] gets the X input for the fit call, and the same for mock_score. I just did this to make sure that we're using the training data we expect as opposed to the ensembling data.

Hopefully that clarifies!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This looks good to me!!! 🎉 🥳

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 this is close! The only blocking comment I have is about the way we're doing the data splitting. Let's find a way to preserve user-overriden woodwork types. I left an example and a couple options in the comments, but we can discuss.

evalml/automl/engine/sequential_engine.py Outdated Show resolved Hide resolved
evalml/automl/engine/engine_base.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
else:
if self.X_ensemble:
X_train = infer_feature_types(self.X_train.to_dataframe().append(self.X_ensemble.to_dataframe(), ignore_index=True), self.X_feature_types)
y_train = infer_feature_types(self.y_train.to_series().append(self.y_ensemble.to_series(), ignore_index=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Converting to pandas and then calling infer_feature_types again means we lose of the custom user-defined types. @angela97lin had to do a bunch of plumbing to avoid this during pipeline evaluation!

An example of this (which would be a great test to add to this PR): user overrides woodwork to set a text feature as categorical. We mock out the ensemble fit method to ensure that the categorical type makes its way to the metalearner instead of the default inferred text type.

It also feels strange to split up the data in the AutoMLSearch constructor, and then merge it again here.

The most compelling option I could think of right now is: in AutoMLSearch at least, pass around indices instead of dataframes/datatables, until the moment we actually need a sub-datatable in order to proceed; then at that point, slice as needed.

We could also keep references to both a) the complete data and b) the splitted data, and use whichever we need when we need it. The difficulty there is that effectively doubles our memory consumption.

Let's discuss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I see what you mean. For the custom user-defined types, wouldn't passing in self.X_train.logical_types to infer_feature_types solve that problem?

As for the issue of splitting and merging unnecessarily, I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data. This would also help with memory consumption but should allow us to keep the same functionality as everything before. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 oh maybe so! @angela97lin does that line up with your understanding? I suggest you check with her on this.

I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data.

If I'm following you correctly here, yep, I'm suggesting we look into doing the splitting on indices first instead of on datatables.

If using indices is not required to fix the type problem, we could file it as a separate issue, but let's be careful about the code we use to do the concatenation here. In fact I'd suggest we use pd.concat instead of append here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 @dsherry
RE using infer_feature_types: If the only thing that's happening is we're taking a subset of rows, then passing self.X_train.logical_types to infer_feature_types should suffice since nothing is happening to the logical_types.

For the transformers, we converted to pandas, and then potentially transformed the data before we converted back to Woodwork, so extra plumbing was necessary (and stored in _retain_custom_types_and_initalize_woodwork).

Of course, I agree that testing would be important here to make sure we're not missing anything :)

I do agree that it feels pretty strange to split up the data in the AutoMLSearch constructor, and then merge it again here though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated the implementation to track the indices instead, so we no longer need to do any concat/append logic when training the best pipeline.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 great!!! I left one blocking comment: we should use split_data to compute the ensembling split in the AutoMLSearch constructor. Otherwise we won't be using the correct split strategy for classification data, timeseries data etc.

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
evalml/automl/automl_search.py Show resolved Hide resolved
evalml/automl/engine/sequential_engine.py Show resolved Hide resolved
evalml/automl/engine/sequential_engine.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_sequential_engine.py Outdated Show resolved Hide resolved
@bchen1116 bchen1116 requested a review from dsherry March 3, 2021 16:31
@bchen1116 bchen1116 merged commit c499006 into main Mar 3, 2021
@dsherry dsherry mentioned this pull request Mar 11, 2021
@freddyaboulton freddyaboulton deleted the bc_1746_cv branch May 13, 2022 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML: use separate CV split for ensembling
5 participants