New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure pipelines receive an identical set of CV or TV splits #2034
Conversation
@@ -77,6 +76,7 @@ def fit_resample(self, X, y): | |||
Returns: | |||
list: Indices to keep for training data | |||
""" | |||
random_state = np.random.RandomState(self.random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself is just to recreate the random number generator used by the undersampling on each invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this need to propagate to #2079 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the oversamplers suffer from this problem but it might be good to add coverage
assert joblib_hash(pipeline0_training_X.to_dataframe()) == joblib_hash(pipeline1_training_X.to_dataframe()) | ||
assert joblib_hash(pipeline0_training_y.to_series()) == joblib_hash(pipeline1_training_y.to_series()) | ||
assert joblib_hash(pipeline0_validation_X.to_dataframe()) == joblib_hash(pipeline1_validation_X.to_dataframe()) | ||
assert joblib_hash(pipeline0_validation_y.to_series()) == joblib_hash(pipeline1_validation_y.to_series()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple version of the great reproducer @freddyaboulton included in #1982 !
@@ -37,15 +37,15 @@ def test_data_splitter_nsplits(splitter): | |||
|
|||
|
|||
@pytest.mark.parametrize("value", [np.nan, "hello"]) | |||
@pytest.mark.parametrize("splitter", | |||
@pytest.mark.parametrize("splitter_cls", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored name for clarity
Codecov Report
@@ Coverage Diff @@
## bc_2099_remove #2034 +/- ##
================================================
- Coverage 100.0% 99.8% -0.1%
================================================
Files 288 289 +1
Lines 24382 24487 +105
================================================
+ Hits 24358 24420 +62
- Misses 24 67 +43
Continue to review full report at Codecov.
|
abf7be3
to
a3a9f39
Compare
Oops, so this change is fine... but I am just realizing #2099 makes this obsolete and that the change actually needs to happen in the undersampler component 🤦 |
|
||
|
||
@pytest.mark.parametrize("splitter_cls", [BalancedClassificationDataTVSplit, BalancedClassificationDataCVSplit]) | ||
def test_data_splitter_multirun(splitter_cls, X_y_binary, X_y_multi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I think we also need to test on imbalanced datasets to make sure your fix works as intended. If the dataset is balanced, no random sampling will happen so your change in logic won't actually be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pending the addition of an imbalanced test run per @freddyaboulton's suggestion, this looks solid.
@@ -77,6 +76,7 @@ def fit_resample(self, X, y): | |||
Returns: | |||
list: Indices to keep for training data | |||
""" | |||
random_state = np.random.RandomState(self.random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this need to propagate to #2079 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with Freddy that a test for imbalanced data should be used!
Closing because we actually need to make this change in the sampler components |
a3a9f39
to
5c2f2f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Thanks for the fix! This looks great but I think we should add coverage for imbalanced datasets in test_classification_balanced_multirun
and test_data_splitter_gives_pipelines_same_data
. The original issue only happens with imbalanced datasets.
@@ -77,6 +76,7 @@ def fit_resample(self, X, y): | |||
Returns: | |||
list: Indices to keep for training data | |||
""" | |||
random_state = np.random.RandomState(self.random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the oversamplers suffer from this problem but it might be good to add coverage
…non-fit transformation/prediction.
bd9f6d0
to
4515fda
Compare
Closing in favor of #2210 |
Fixes #1982
Note the base branch is
bc_2099_remove
-- #2193 needs to get merged first, just so that I don't have to add data splitter test coverage in this PR only for us to delete them 😁 Once that's merged I'll reset base tomain