Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure pipelines receive an identical set of CV or TV splits #2034

Closed
wants to merge 11 commits into from

Conversation

dsherry
Copy link
Contributor

@dsherry dsherry commented Mar 25, 2021

Fixes #1982

Note the base branch is bc_2099_remove -- #2193 needs to get merged first, just so that I don't have to add data splitter test coverage in this PR only for us to delete them 😁 Once that's merged I'll reset base to main

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
Returns:
list: Indices to keep for training data
"""
random_state = np.random.RandomState(self.random_seed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself is just to recreate the random number generator used by the undersampling on each invocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this need to propagate to #2079 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the oversamplers suffer from this problem but it might be good to add coverage

assert joblib_hash(pipeline0_training_X.to_dataframe()) == joblib_hash(pipeline1_training_X.to_dataframe())
assert joblib_hash(pipeline0_training_y.to_series()) == joblib_hash(pipeline1_training_y.to_series())
assert joblib_hash(pipeline0_validation_X.to_dataframe()) == joblib_hash(pipeline1_validation_X.to_dataframe())
assert joblib_hash(pipeline0_validation_y.to_series()) == joblib_hash(pipeline1_validation_y.to_series())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple version of the great reproducer @freddyaboulton included in #1982 !

@@ -37,15 +37,15 @@ def test_data_splitter_nsplits(splitter):


@pytest.mark.parametrize("value", [np.nan, "hello"])
@pytest.mark.parametrize("splitter",
@pytest.mark.parametrize("splitter_cls",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored name for clarity

@codecov
Copy link

codecov bot commented Mar 25, 2021

Codecov Report

Merging #2034 (bd9f6d0) into bc_2099_remove (e59d795) will decrease coverage by 0.2%.
The diff coverage is 100.0%.

❗ Current head bd9f6d0 differs from pull request most recent head 4515fda. Consider uploading reports for the commit 4515fda to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           bc_2099_remove   #2034     +/-   ##
================================================
- Coverage           100.0%   99.8%   -0.1%     
================================================
  Files                 288     289      +1     
  Lines               24382   24487    +105     
================================================
+ Hits                24358   24420     +62     
- Misses                 24      67     +43     
Impacted Files Coverage Δ
...s/components/transformers/samplers/base_sampler.py 100.0% <100.0%> (ø)
.../data_splitters/balanced_classification_sampler.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.7% <100.0%> (+0.1%) ⬆️
...sing_tests/test_balanced_classification_sampler.py 100.0% <100.0%> (ø)
...lml/preprocessing/data_splitters/base_splitters.py 25.9% <0.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e59d795...4515fda. Read the comment docs.

@CLAassistant
Copy link

CLAassistant commented Mar 26, 2021

CLA assistant check
All committers have signed the CLA.

@dsherry dsherry force-pushed the ds_1982_splitter_run_over_run branch from abf7be3 to a3a9f39 Compare April 6, 2021 19:22
@dsherry dsherry marked this pull request as ready for review April 6, 2021 19:22
@dsherry
Copy link
Contributor Author

dsherry commented Apr 6, 2021

Oops, so this change is fine... but I am just realizing #2099 makes this obsolete and that the change actually needs to happen in the undersampler component 🤦



@pytest.mark.parametrize("splitter_cls", [BalancedClassificationDataTVSplit, BalancedClassificationDataCVSplit])
def test_data_splitter_multirun(splitter_cls, X_y_binary, X_y_multi):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I think we also need to test on imbalanced datasets to make sure your fix works as intended. If the dataset is balanced, no random sampling will happen so your change in logic won't actually be used.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pending the addition of an imbalanced test run per @freddyaboulton's suggestion, this looks solid.

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
Returns:
list: Indices to keep for training data
"""
random_state = np.random.RandomState(self.random_seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this need to propagate to #2079 ?

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Freddy that a test for imbalanced data should be used!

@dsherry
Copy link
Contributor Author

dsherry commented Apr 29, 2021

Closing because we actually need to make this change in the sampler components

@dsherry dsherry closed this Apr 29, 2021
@dsherry dsherry reopened this Apr 30, 2021
@dsherry dsherry force-pushed the ds_1982_splitter_run_over_run branch from a3a9f39 to 5c2f2f4 Compare April 30, 2021 04:13
@dsherry dsherry changed the base branch from main to bc_2099_remove April 30, 2021 04:13
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Thanks for the fix! This looks great but I think we should add coverage for imbalanced datasets in test_classification_balanced_multirun and test_data_splitter_gives_pipelines_same_data. The original issue only happens with imbalanced datasets.

@@ -77,6 +76,7 @@ def fit_resample(self, X, y):
Returns:
list: Indices to keep for training data
"""
random_state = np.random.RandomState(self.random_seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the oversamplers suffer from this problem but it might be good to add coverage

Basic unit test coverage

Test updates

Got splitter/sampler tests passing

Add automl test

Changelog

Py3.7 mock handling

Revert changes to balanced classification data splitter
@dsherry dsherry force-pushed the ds_1982_splitter_run_over_run branch from bd9f6d0 to 4515fda Compare May 3, 2021 15:24
@dsherry
Copy link
Contributor Author

dsherry commented May 21, 2021

Closing in favor of #2210

@dsherry dsherry closed this May 21, 2021
@freddyaboulton freddyaboulton deleted the ds_1982_splitter_run_over_run branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure all pipelines in AutoMLSearch receive the same data splits
5 participants