Separate CV for Stacked Ensembler in AutoMLSearch #1814

bchen1116 · 2021-02-10T18:56:42Z

Perf tests results are here

As @chukarsten brings up here, should we add an ensembling_split arg to AutoMLSearch so that users can define what % of the data they want for the ensembling split? Should we test other split sizes?

codecov · 2021-02-10T19:04:19Z

Codecov Report

Merging #1814 (19ae76e) into main (79e4f75) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1814     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         267      267             
  Lines       21400    21536    +136     
=========================================
+ Hits        21394    21530    +136     
  Misses          6        6

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`100.0% <100.0%> (ø)`
evalml/automl/engine/engine_base.py	`100.0% <100.0%> (ø)`
evalml/automl/engine/sequential_engine.py	`100.0% <100.0%> (ø)`
evalml/tests/automl_tests/test_automl.py	`100.0% <100.0%> (ø)`
...valml/tests/automl_tests/test_sequential_engine.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79e4f75...19ae76e. Read the comment docs.

chukarsten

Hey Bryan, putting changes requested just so we can have a better discussion when I'm less tired, haha. This looks good, I'd just like to think together about what the engine should be doing vs. automl.

chukarsten · 2021-02-27T19:31:39Z

evalml/automl/automl_search.py

@@ -318,6 +321,10 @@ def __init__(self,
                                       num_ensemble_batches)
            else:
                self.max_iterations = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self.max_batches - 1))
+        if run_ensembling:
+            self.X_train, self.X_ensemble, self.y_train, self.y_ensemble = split_data(self.X_train, self.y_train, problem_type=self.problem_type, test_size=0.20)


Are we good with hardcoding 1/5th ensembling test split for the first crack? I think we should either file a good first issue to give AutoMLSearch() the size of the ensembling split or just do it here. Either way, I think it'll need some checks/logging around what happens if you pass in a ensembling_split_size without ensembling being passed.

@chukarsten right, I see your point. Through my discussions with @dsherry, we decided that a 0.8/0.2 split seemed fine for separating the ensembling, but I do see your concern about making it more adjustable/customizable. I don't see a problem with adding it as a (private for now) arg to AutoMLSearch, in case a user wants to fine tune it?

What do you think, @dsherry?

If you guys chatted about it, let's just file an issue to handle it later, then!

I'll leave this up for discussion for now, in case other people have inputs.

Ah yes, let's please make this configurable! I think its fine to default it to 20%.

evalml/automl/engine/engine_base.py

evalml/tests/automl_tests/test_automl.py

chukarsten · 2021-02-28T15:16:37Z

evalml/automl/engine/sequential_engine.py

@@ -20,7 +20,8 @@ def evaluate_batch(self, pipelines):
        while self._should_continue_callback() and index < len(pipelines):
            pipeline = pipelines[index]
            self._pre_evaluation_callback(pipeline)
-            evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
+            evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train,


I'm not convinced that ensembling related logic falls within the Engine's area of responsibility. I feel like we should keep the engine as responsible for training, fitting, scoring pipelines that are fed to it via train_and_score_pipeline() and freddy's upcoming fit_batch() and evaluate_batch() functionality. I'm not sure - I'm super tired. it just seems weird passing X_ensemble and y_ensemble to train_and_score. @dsherry, what do you think?

@chukarsten, do you mean you'd prefer a check like:

# in sequential_engine.py # while loop over pipeines if pipeline.model_family == ModelFamily.Ensemble: evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble) else: evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)

versus what I currently have? Or do something different altogether?

Yea, I think I like that. I think at least keeping it out of train and score pipelines will be helpful.

I like that too @bchen1116 !

chukarsten

freddyaboulton

@bchen1116 I think this is great! I think the only thing blocking merge is that the concatenation of X_train and X_ensemble should preserve logical types passed in by the user.

evalml/automl/automl_search.py

freddyaboulton · 2021-03-02T01:49:15Z

evalml/automl/engine/sequential_engine.py

@@ -20,7 +21,10 @@ def evaluate_batch(self, pipelines):
        while self._should_continue_callback() and index < len(pipelines):
            pipeline = pipelines[index]
            self._pre_evaluation_callback(pipeline)
-            evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train)
+            if pipeline.model_family == ModelFamily.ENSEMBLE:


Nice! I imagine we'll want to do this for all engines so it might be worth it to add it to EngineBase.evaluate_batch and then determine the dataset to use with super. I don't want to scope creep. Maybe it's best for Karsten to do consolidate the logic once the modifications to SequentialEngine and EngineBase settle down. What are your thoughts @chukarsten @bchen1116 ?

Sure, that works for me. Would have to see what that looks like with the call to super(). Do you mean putting this whole if/else branch in super and swapping the X/y?

It's trickier than I thought since the logic depends on a single pipeline but the function takes in a list of pipelines. Might be better to encapsulate the logic in a separate method rather than via super. Anyway it may be best to do it later but we should strive to not replicate this in all engines.

@chukarsten @freddyaboulton what if we did something like

# in engine_base.py # in engine class def handle_train_and_score_pipeline(pipeline): if pipeline.model_family == ModelFamily.ENSEMBLE: evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_ensemble, self.y_ensemble) else: evaluation_result = EngineBase.train_and_score_pipeline(pipeline, self.automl, self.X_train, self.y_train) return evaluation_result

Then in SequentialEngine, you can simply call super().handle_train_and_score_pipeline(pipeline)?

This way, it'll be in the base Engine class, although we wouldn't be adding it to EngineBase.evaluate_batch, and it might just be an extra method call that we don't need.

@bchen1116 I like that! I think we'll need to do re-use the logic for #1729 so it may be best to return the data to use (X_ensemble vs X_train) rather than the evaluation result.

I think it's best for me to handle this in the PR for #1729. I'm convinced this is scope creep now cause the best way to do this depends on requirements not relating to this original issue.

Sorry for the misdirection @bchen1116!

Ah, yep that makes sense! Thanks for pointing that out!

evalml/tests/automl_tests/test_automl.py

angela97lin

This looks good! Just had a comment for my own understanding hehe

angela97lin · 2021-03-02T16:08:42Z

evalml/tests/automl_tests/test_automl.py

+    if ensembling:
+        assert automl.ensembling
+        # check that the X_train data is all used for the length
+        assert len(X_train) == (len(mock_fit.call_args_list[-2][0][0]) + len(mock_score.call_args_list[-2][0][0]))


Trying to follow along, but why are we checking this against the length of the mock_fit and length of the mock_score args? 🤔

This checks that the second-to-last pipeline (which isn't ensembling) uses the full X_train dataset between the fit and score calls. The mock_fit.call_args_list[-2][0][0] gets the X input for the fit call, and the same for mock_score. I just did this to make sure that we're using the training data we expect as opposed to the ensembling data.

Hopefully that clarifies!

evalml/tests/automl_tests/test_automl.py

evalml/automl/automl_search.py

freddyaboulton

@bchen1116 This looks good to me!!! 🎉 🥳

dsherry

@bchen1116 this is close! The only blocking comment I have is about the way we're doing the data splitting. Let's find a way to preserve user-overriden woodwork types. I left an example and a couple options in the comments, but we can discuss.

evalml/automl/engine/sequential_engine.py

evalml/automl/engine/engine_base.py

evalml/automl/automl_search.py

dsherry · 2021-03-02T20:42:39Z

evalml/automl/automl_search.py

+                else:
+                    if self.X_ensemble:
+                        X_train = infer_feature_types(self.X_train.to_dataframe().append(self.X_ensemble.to_dataframe(), ignore_index=True), self.X_feature_types)
+                        y_train = infer_feature_types(self.y_train.to_series().append(self.y_ensemble.to_series(), ignore_index=True))


Hmm. Converting to pandas and then calling infer_feature_types again means we lose of the custom user-defined types. @angela97lin had to do a bunch of plumbing to avoid this during pipeline evaluation!

An example of this (which would be a great test to add to this PR): user overrides woodwork to set a text feature as categorical. We mock out the ensemble fit method to ensure that the categorical type makes its way to the metalearner instead of the default inferred text type.

It also feels strange to split up the data in the AutoMLSearch constructor, and then merge it again here.

The most compelling option I could think of right now is: in AutoMLSearch at least, pass around indices instead of dataframes/datatables, until the moment we actually need a sub-datatable in order to proceed; then at that point, slice as needed.

We could also keep references to both a) the complete data and b) the splitted data, and use whichever we need when we need it. The difficulty there is that effectively doubles our memory consumption.

Let's discuss.

@dsherry I see what you mean. For the custom user-defined types, wouldn't passing in self.X_train.logical_types to infer_feature_types solve that problem?

As for the issue of splitting and merging unnecessarily, I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data. This would also help with memory consumption but should allow us to keep the same functionality as everything before. What do you think?

@bchen1116 oh maybe so! @angela97lin does that line up with your understanding? I suggest you check with her on this.

I think the best idea would then be to keep a reference to the indices of the ensembling data rather than the full split ensembling data.

If I'm following you correctly here, yep, I'm suggesting we look into doing the splitting on indices first instead of on datatables.

If using indices is not required to fix the type problem, we could file it as a separate issue, but let's be careful about the code we use to do the concatenation here. In fact I'd suggest we use pd.concat instead of append here.

@bchen1116 @dsherry
RE using infer_feature_types: If the only thing that's happening is we're taking a subset of rows, then passing self.X_train.logical_types to infer_feature_types should suffice since nothing is happening to the logical_types.

For the transformers, we converted to pandas, and then potentially transformed the data before we converted back to Woodwork, so extra plumbing was necessary (and stored in _retain_custom_types_and_initalize_woodwork).

Of course, I agree that testing would be important here to make sure we're not missing anything :)

I do agree that it feels pretty strange to split up the data in the AutoMLSearch constructor, and then merge it again here though...

Just updated the implementation to track the indices instead, so we no longer need to do any concat/append logic when training the best pipeline.

dsherry

@bchen1116 great!!! I left one blocking comment: we should use split_data to compute the ensembling split in the AutoMLSearch constructor. Otherwise we won't be using the correct split strategy for classification data, timeseries data etc.

evalml/automl/automl_search.py

evalml/automl/engine/sequential_engine.py

evalml/tests/automl_tests/test_sequential_engine.py

bchen1116 added 3 commits February 10, 2021 13:54

implementation for separate cv split for ensembling

9c3d7c7

Merge branch 'main' into bc_1746_cv

74ba760

adding release

0b42617

bchen1116 self-assigned this Feb 10, 2021

update release notes

7087a1d

bchen1116 added 11 commits February 10, 2021 14:25

fix doc

40beb72

linting and fixing tests

8d663d9

fix tests

73f1007

Merge branch 'main' into bc_1746_cv

34445f8

in progress

1793f32

fix test

fbb2ccc

fix test

5418984

fix test

13f15ea

comment

813832b

change value

24ad28f

fix train best pipeline

146db1d

bchen1116 marked this pull request as ready for review February 26, 2021 18:54

bchen1116 requested review from dsherry, freddyaboulton, angela97lin, chukarsten, jeremyliweishih and ParthivNaresh and removed request for dsherry and freddyaboulton February 26, 2021 18:54

Merge branch 'main' into bc_1746_cv

f1da75c

chukarsten suggested changes Feb 28, 2021

View reviewed changes

bchen1116 added 2 commits March 1, 2021 11:01

Merge branch 'main' into bc_1746_cv

207595f

Merge branch 'main' into bc_1746_cv

fea92ce

bchen1116 requested a review from chukarsten March 1, 2021 21:19

chukarsten approved these changes Mar 1, 2021

View reviewed changes

freddyaboulton suggested changes Mar 2, 2021

View reviewed changes

add arg to automl and fix tests

38fbd22

bchen1116 requested a review from freddyaboulton March 2, 2021 15:55

angela97lin approved these changes Mar 2, 2021

View reviewed changes

freddyaboulton mentioned this pull request Mar 2, 2021

Training and scoring pipelines with AutoMLSearch #1913

Merged

freddyaboulton reviewed Mar 2, 2021

View reviewed changes

evalml/tests/automl_tests/test_automl.py Show resolved Hide resolved

freddyaboulton reviewed Mar 2, 2021

View reviewed changes

evalml/automl/automl_search.py Outdated Show resolved Hide resolved

fix best pipeline training

5b07a77

bchen1116 requested a review from freddyaboulton March 2, 2021 18:23

freddyaboulton approved these changes Mar 2, 2021

View reviewed changes

Merge branch 'main' into bc_1746_cv

ff6ec89

freddyaboulton mentioned this pull request Mar 2, 2021

Spike: New Engine Design #1914

Closed

Merge branch 'main' into bc_1746_cv

ac00c6f

dsherry suggested changes Mar 2, 2021

View reviewed changes

bchen1116 added 5 commits March 2, 2021 17:26

track indices instead of datatables

e31664f

clean up code

d0930d9

fix spwacing

c18299f

linting

e2f472d

fix tests

ecf9fd9

dsherry approved these changes Mar 2, 2021

View reviewed changes

bchen1116 added 2 commits March 3, 2021 10:43

use split_data util instead

4420712

Merge branch 'main' into bc_1746_cv

19ae76e

bchen1116 requested a review from dsherry March 3, 2021 16:31

bchen1116 merged commit c499006 into main Mar 3, 2021

dsherry mentioned this pull request Mar 11, 2021

v0.20.0 #1961

Merged

freddyaboulton deleted the bc_1746_cv branch May 13, 2022 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate CV for Stacked Ensembler in AutoMLSearch #1814

Separate CV for Stacked Ensembler in AutoMLSearch #1814

bchen1116 commented Feb 10, 2021 •

edited

codecov bot commented Feb 10, 2021 •

edited

chukarsten left a comment

chukarsten Feb 27, 2021

bchen1116 Mar 1, 2021 •

edited

chukarsten Mar 1, 2021

bchen1116 Mar 1, 2021

dsherry Mar 1, 2021

chukarsten Feb 28, 2021

bchen1116 Mar 1, 2021

chukarsten Mar 1, 2021

freddyaboulton Mar 2, 2021

chukarsten left a comment

freddyaboulton left a comment

freddyaboulton Mar 2, 2021

chukarsten Mar 2, 2021

freddyaboulton Mar 2, 2021 •

edited

bchen1116 Mar 2, 2021

freddyaboulton Mar 2, 2021

bchen1116 Mar 2, 2021

angela97lin left a comment

angela97lin Mar 2, 2021

bchen1116 Mar 2, 2021

freddyaboulton left a comment

dsherry left a comment

dsherry Mar 2, 2021

bchen1116 Mar 2, 2021

dsherry Mar 2, 2021

angela97lin Mar 2, 2021

bchen1116 Mar 2, 2021

dsherry left a comment

Separate CV for Stacked Ensembler in AutoMLSearch #1814

Separate CV for Stacked Ensembler in AutoMLSearch #1814

Conversation

bchen1116 commented Feb 10, 2021 • edited

codecov bot commented Feb 10, 2021 • edited

Codecov Report

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 Mar 1, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Mar 2, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

bchen1116 commented Feb 10, 2021 •

edited

codecov bot commented Feb 10, 2021 •

edited

bchen1116 Mar 1, 2021 •

edited

freddyaboulton Mar 2, 2021 •

edited