-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate CV for Stacked Ensembler in AutoMLSearch #1814
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1814 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 267 267
Lines 21400 21536 +136
=========================================
+ Hits 21394 21530 +136
Misses 6 6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Bryan, putting changes requested just so we can have a better discussion when I'm less tired, haha. This looks good, I'd just like to think together about what the engine should be doing vs. automl.
evalml/automl/automl_search.py
Outdated
@@ -318,6 +321,10 @@ def __init__(self, | |||
num_ensemble_batches) | |||
else: | |||
self.max_iterations = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self.max_batches - 1)) | |||
if run_ensembling: | |||
self.X_train, self.X_ensemble, self.y_train, self.y_ensemble = split_data(self.X_train, self.y_train, problem_type=self.problem_type, test_size=0.20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we good with hardcoding 1/5th ensembling test split for the first crack? I think we should either file a good first issue to give AutoMLSearch() the size of the ensembling split or just do it here. Either way, I think it'll need some checks/logging around what happens if you pass in a ensembling_split_size without ensembling being passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten right, I see your point. Through my discussions with @dsherry, we decided that a 0.8/0.2 split seemed fine for separating the ensembling, but I do see your concern about making it more adjustable/customizable. I don't see a problem with adding it as a (private for now) arg to AutoMLSearch, in case a user wants to fine tune it?
What do you think, @dsherry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you guys chatted about it, let's just file an issue to handle it later, then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this up for discussion for now, in case other people have inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, let's please make this configurable! I think its fine to default it to 20%.
@@ -20,7 +20,8 @@ def evaluate_batch(self, pipelines): | |||
while self._should_continue_callback() and index < len(pipelines): | |||
pipeline = pipelines[index] | |||
self._pre_evaluation_callback(pipeline) | |||
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train) | |||
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that ensembling related logic falls within the Engine's area of responsibility. I feel like we should keep the engine as responsible for training, fitting, scoring pipelines that are fed to it via train_and_score_pipeline() and freddy's upcoming fit_batch() and evaluate_batch() functionality. I'm not sure - I'm super tired. it just seems weird passing X_ensemble and y_ensemble to train_and_score. @dsherry, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten, do you mean you'd prefer a check like:
# in sequential_engine.py
# while loop over pipeines
if pipeline.model_family == ModelFamily.Ensemble:
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble)
else:
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
versus what I currently have? Or do something different altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think I like that. I think at least keeping it out of train and score pipelines will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that too @bchen1116 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I think this is great! I think the only thing blocking merge is that the concatenation of X_train and X_ensemble should preserve logical types passed in by the user.
@@ -20,7 +21,10 @@ def evaluate_batch(self, pipelines): | |||
while self._should_continue_callback() and index < len(pipelines): | |||
pipeline = pipelines[index] | |||
self._pre_evaluation_callback(pipeline) | |||
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train) | |||
if pipeline.model_family == ModelFamily.ENSEMBLE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I imagine we'll want to do this for all engines so it might be worth it to add it to EngineBase.evaluate_batch
and then determine the dataset to use with super
. I don't want to scope creep. Maybe it's best for Karsten to do consolidate the logic once the modifications to SequentialEngine
and EngineBase
settle down. What are your thoughts @chukarsten @bchen1116 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that works for me. Would have to see what that looks like with the call to super(). Do you mean putting this whole if/else branch in super and swapping the X/y?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trickier than I thought since the logic depends on a single pipeline but the function takes in a list of pipelines. Might be better to encapsulate the logic in a separate method rather than via super. Anyway it may be best to do it later but we should strive to not replicate this in all engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chukarsten @freddyaboulton what if we did something like
# in engine_base.py
# in engine class
def handle_train_and_score_pipeline(pipeline):
if pipeline.model_family == ModelFamily.ENSEMBLE:
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble)
else:
evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
return evaluation_result
Then in SequentialEngine
, you can simply call super().handle_train_and_score_pipeline(pipeline)
?
This way, it'll be in the base Engine class, although we wouldn't be adding it to EngineBase.evaluate_batch
, and it might just be an extra method call that we don't need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 I like that! I think we'll need to do re-use the logic for #1729 so it may be best to return the data to use (X_ensemble
vs X_train
) rather than the evaluation result.
I think it's best for me to handle this in the PR for #1729. I'm convinced this is scope creep now cause the best way to do this depends on requirements not relating to this original issue.
Sorry for the misdirection @bchen1116!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yep that makes sense! Thanks for pointing that out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Just had a comment for my own understanding hehe
if ensembling: | ||
assert automl.ensembling | ||
# check that the X_train data is all used for the length | ||
assert len(X_train) == (len(mock_fit.call_args_list[-2][0][0]) + len(mock_score.call_args_list[-2][0][0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to follow along, but why are we checking this against the length of the mock_fit and length of the mock_score args? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks that the second-to-last pipeline (which isn't ensembling) uses the full X_train dataset between the fit and score calls. The mock_fit.call_args_list[-2][0][0]
gets the X input for the fit call, and the same for mock_score
. I just did this to make sure that we're using the training data we expect as opposed to the ensembling data.
Hopefully that clarifies!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 This looks good to me!!! 🎉 🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 this is close! The only blocking comment I have is about the way we're doing the data splitting. Let's find a way to preserve user-overriden woodwork types. I left an example and a couple options in the comments, but we can discuss.
evalml/automl/automl_search.py
Outdated
else: | ||
if self.X_ensemble: | ||
X_train = infer_feature_types(self.X_train.to_dataframe().append(self.X_ensemble.to_dataframe(), ignore_index=True), self.X_feature_types) | ||
y_train = infer_feature_types(self.y_train.to_series().append(self.y_ensemble.to_series(), ignore_index=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Converting to pandas and then calling infer_feature_types
again means we lose of the custom user-defined types. @angela97lin had to do a bunch of plumbing to avoid this during pipeline evaluation!
An example of this (which would be a great test to add to this PR): user overrides woodwork to set a text feature as categorical. We mock out the ensemble fit method to ensure that the categorical type makes its way to the metalearner instead of the default inferred text type.
It also feels strange to split up the data in the AutoMLSearch
constructor, and then merge it again here.
The most compelling option I could think of right now is: in AutoMLSearch
at least, pass around indices instead of dataframes/datatables, until the moment we actually need a sub-datatable in order to proceed; then at that point, slice as needed.
We could also keep references to both a) the complete data and b) the splitted data, and use whichever we need when we need it. The difficulty there is that effectively doubles our memory consumption.
Let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I see what you mean. For the custom user-defined types, wouldn't passing in self.X_train.logical_types
to infer_feature_types
solve that problem?
As for the issue of splitting and merging unnecessarily, I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data. This would also help with memory consumption but should allow us to keep the same functionality as everything before. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 oh maybe so! @angela97lin does that line up with your understanding? I suggest you check with her on this.
I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data.
If I'm following you correctly here, yep, I'm suggesting we look into doing the splitting on indices first instead of on datatables.
If using indices is not required to fix the type problem, we could file it as a separate issue, but let's be careful about the code we use to do the concatenation here. In fact I'd suggest we use pd.concat
instead of append
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 @dsherry
RE using infer_feature_types
: If the only thing that's happening is we're taking a subset of rows, then passing self.X_train.logical_types
to infer_feature_types
should suffice since nothing is happening to the logical_types
.
For the transformers, we converted to pandas, and then potentially transformed the data before we converted back to Woodwork, so extra plumbing was necessary (and stored in _retain_custom_types_and_initalize_woodwork
).
Of course, I agree that testing would be important here to make sure we're not missing anything :)
I do agree that it feels pretty strange to split up the data in the AutoMLSearch constructor, and then merge it again here though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated the implementation to track the indices instead, so we no longer need to do any concat/append
logic when training the best pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 great!!! I left one blocking comment: we should use split_data
to compute the ensembling split in the AutoMLSearch
constructor. Otherwise we won't be using the correct split strategy for classification data, timeseries data etc.
fix #1746
Perf tests results are here
As @chukarsten brings up here, should we add an
ensembling_split
arg to AutoMLSearch so that users can define what % of the data they want for the ensembling split? Should we test other split sizes?