Ensure pipelines receive an identical set of CV or TV splits #2034

dsherry · 2021-03-25T00:40:07Z

Fixes #1982

Note the base branch is bc_2099_remove -- #2193 needs to get merged first, just so that I don't have to add data splitter test coverage in this PR only for us to delete them 😁 Once that's merged I'll reset base to main

dsherry · 2021-03-25T00:41:54Z

evalml/preprocessing/data_splitters/balanced_classification_sampler.py

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
        Returns:
            list: Indices to keep for training data
        """
+        random_state = np.random.RandomState(self.random_seed)


The fix itself is just to recreate the random number generator used by the undersampling on each invocation.

Will this need to propagate to #2079 ?

I don't think the oversamplers suffer from this problem but it might be good to add coverage

dsherry · 2021-03-25T00:42:27Z

evalml/tests/automl_tests/test_automl.py

+        assert joblib_hash(pipeline0_training_X.to_dataframe()) == joblib_hash(pipeline1_training_X.to_dataframe())
+        assert joblib_hash(pipeline0_training_y.to_series()) == joblib_hash(pipeline1_training_y.to_series())
+        assert joblib_hash(pipeline0_validation_X.to_dataframe()) == joblib_hash(pipeline1_validation_X.to_dataframe())
+        assert joblib_hash(pipeline0_validation_y.to_series()) == joblib_hash(pipeline1_validation_y.to_series())


This is a simple version of the great reproducer @freddyaboulton included in #1982 !

dsherry · 2021-03-25T00:42:39Z

evalml/tests/preprocessing_tests/test_balanced_classification_data_splitter.py

@@ -37,15 +37,15 @@ def test_data_splitter_nsplits(splitter):


 @pytest.mark.parametrize("value", [np.nan, "hello"])
-@pytest.mark.parametrize("splitter",
+@pytest.mark.parametrize("splitter_cls",


Refactored name for clarity

codecov · 2021-03-25T00:51:43Z

Codecov Report

Merging #2034 (bd9f6d0) into bc_2099_remove (e59d795) will decrease coverage by 0.2%.
The diff coverage is 100.0%.

❗ Current head bd9f6d0 differs from pull request most recent head 4515fda. Consider uploading reports for the commit 4515fda to get more accurate results

@@               Coverage Diff                @@
##           bc_2099_remove   #2034     +/-   ##
================================================
- Coverage           100.0%   99.8%   -0.1%     
================================================
  Files                 288     289      +1     
  Lines               24382   24487    +105     
================================================
+ Hits                24358   24420     +62     
- Misses                 24      67     +43

Impacted Files	Coverage Δ
...s/components/transformers/samplers/base_sampler.py	`100.0% <100.0%> (ø)`
.../data_splitters/balanced_classification_sampler.py	`100.0% <100.0%> (ø)`
evalml/tests/automl_tests/test_automl.py	`99.7% <100.0%> (+0.1%)`	⬆️
...sing_tests/test_balanced_classification_sampler.py	`100.0% <100.0%> (ø)`
...lml/preprocessing/data_splitters/base_splitters.py	`25.9% <0.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e59d795...4515fda. Read the comment docs.

CLAassistant · 2021-03-26T14:34:45Z

All committers have signed the CLA.

dsherry · 2021-04-06T19:31:08Z

Oops, so this change is fine... but I am just realizing #2099 makes this obsolete and that the change actually needs to happen in the undersampler component 🤦

freddyaboulton · 2021-04-06T19:39:39Z

evalml/tests/preprocessing_tests/test_balanced_classification_data_splitter.py

+
+
+@pytest.mark.parametrize("splitter_cls", [BalancedClassificationDataTVSplit, BalancedClassificationDataCVSplit])
+def test_data_splitter_multirun(splitter_cls, X_y_binary, X_y_multi):


@dsherry I think we also need to test on imbalanced datasets to make sure your fix works as intended. If the dataset is balanced, no random sampling will happen so your change in logic won't actually be used.

chukarsten

I think pending the addition of an imbalanced test run per @freddyaboulton's suggestion, this looks solid.

chukarsten · 2021-04-12T17:09:59Z

evalml/preprocessing/data_splitters/balanced_classification_sampler.py

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
        Returns:
            list: Indices to keep for training data
        """
+        random_state = np.random.RandomState(self.random_seed)


Will this need to propagate to #2079 ?

bchen1116

Agreed with Freddy that a test for imbalanced data should be used!

dsherry · 2021-04-29T16:03:09Z

Closing because we actually need to make this change in the sampler components

freddyaboulton

@dsherry Thanks for the fix! This looks great but I think we should add coverage for imbalanced datasets in test_classification_balanced_multirun and test_data_splitter_gives_pipelines_same_data. The original issue only happens with imbalanced datasets.

freddyaboulton · 2021-04-30T14:04:12Z

evalml/preprocessing/data_splitters/balanced_classification_sampler.py

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
        Returns:
            list: Indices to keep for training data
        """
+        random_state = np.random.RandomState(self.random_seed)


I don't think the oversamplers suffer from this problem but it might be good to add coverage

Basic unit test coverage Test updates Got splitter/sampler tests passing Add automl test Changelog Py3.7 mock handling Revert changes to balanced classification data splitter

…sampler

…non-fit transformation/prediction.

dsherry · 2021-05-21T22:33:13Z

Closing in favor of #2210

dsherry commented Mar 25, 2021

View reviewed changes

dsherry force-pushed the ds_1982_splitter_run_over_run branch from abf7be3 to a3a9f39 Compare April 6, 2021 19:22

dsherry marked this pull request as ready for review April 6, 2021 19:22

auto-assign bot assigned dsherry Apr 6, 2021

dsherry requested review from bchen1116, freddyaboulton and chukarsten April 6, 2021 19:22

freddyaboulton reviewed Apr 6, 2021

View reviewed changes

chukarsten approved these changes Apr 12, 2021

View reviewed changes

bchen1116 reviewed Apr 20, 2021

View reviewed changes

dsherry closed this Apr 29, 2021

dsherry reopened this Apr 30, 2021

dsherry force-pushed the ds_1982_splitter_run_over_run branch from a3a9f39 to 5c2f2f4 Compare April 30, 2021 04:13

dsherry changed the base branch from main to bc_2099_remove April 30, 2021 04:13

dsherry requested review from bchen1116, freddyaboulton and chukarsten April 30, 2021 04:14

freddyaboulton approved these changes Apr 30, 2021

View reviewed changes

dsherry added 6 commits May 3, 2021 11:24

Fix

47a715e

Basic unit test coverage Test updates Got splitter/sampler tests passing Add automl test Changelog Py3.7 mock handling Revert changes to balanced classification data splitter

Simplify code

c1fb1be

Missed a spot

f47c836

Oops, missed a spot in simplification

cea92a7

Remove another dead file

4825c42

Remove unused base class

54a5579

dsherry added 5 commits May 3, 2021 11:24

Move code into undersampler

65ab094

lint

60c6e57

Simplify oversampler code. Same change to fit_transform as with under…

2411e25

…sampler

Finish the refactor of smote classes from Sampler to Oversampler

ce605c3

Update ComponentGraph to only apply samplers at fit-time, not during …

4515fda

…non-fit transformation/prediction.

dsherry force-pushed the ds_1982_splitter_run_over_run branch from bd9f6d0 to 4515fda Compare May 3, 2021 15:24

dsherry closed this May 21, 2021

freddyaboulton deleted the ds_1982_splitter_run_over_run branch May 13, 2022 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure pipelines receive an identical set of CV or TV splits #2034

Ensure pipelines receive an identical set of CV or TV splits #2034

dsherry commented Mar 25, 2021 •

edited

dsherry Mar 25, 2021

chukarsten Apr 12, 2021

freddyaboulton Apr 30, 2021

dsherry Mar 25, 2021

dsherry Mar 25, 2021

codecov bot commented Mar 25, 2021 •

edited

CLAassistant commented Mar 26, 2021 •

edited

dsherry commented Apr 6, 2021

freddyaboulton Apr 6, 2021

chukarsten left a comment

chukarsten Apr 12, 2021

bchen1116 left a comment

dsherry commented Apr 29, 2021

freddyaboulton left a comment •

edited

freddyaboulton Apr 30, 2021

dsherry commented May 21, 2021



		@pytest.mark.parametrize("splitter_cls", [BalancedClassificationDataTVSplit, BalancedClassificationDataCVSplit])
		def test_data_splitter_multirun(splitter_cls, X_y_binary, X_y_multi):

Ensure pipelines receive an identical set of CV or TV splits #2034

Ensure pipelines receive an identical set of CV or TV splits #2034

Conversation

dsherry commented Mar 25, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 25, 2021 • edited

Codecov Report

CLAassistant commented Mar 26, 2021 • edited

dsherry commented Apr 6, 2021

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

dsherry commented Apr 29, 2021

freddyaboulton left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry commented May 21, 2021

dsherry commented Mar 25, 2021 •

edited

codecov bot commented Mar 25, 2021 •

edited

CLAassistant commented Mar 26, 2021 •

edited

freddyaboulton left a comment •

edited