Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

Closed
freddyaboulton opened this issue Mar 16, 2021 · 4 comments · Fixed by #2513
Closed

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

freddyaboulton opened this issue Mar 16, 2021 · 4 comments · Fixed by #2513
Assignees
Labels
testing Issues related to testing.

Comments

@freddyaboulton
Copy link
Contributor

freddyaboulton commented Mar 16, 2021

Repro

import joblib
from evalml.demos import load_fraud
from evalml.preprocessing.data_splitters import BalancedClassificationDataCVSplit

splitter = BalancedClassificationDataCVSplit(n_splits=3, random_seed=0, shuffle=True)

X, y = load_fraud(5000)
X = X.to_dataframe()
y = y.to_series().astype("int")

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('75f1b95d7ce307ac6c793055330969aa', '8c89fe1a592c50a700b6d5cbb02dba8b')
('f8c849bbfbed37c13f66c5c742e237cb', '9c4879fb550fded8be9ac03e95a1bf95')
('cdc21f0d6bbf45459c9695258f7f04dc', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('bf462b82af243c552ac48acad2dfd748', '8c89fe1a592c50a700b6d5cbb02dba8b')
('b8341b536c63c7957c099b05e315f49c', '9c4879fb550fded8be9ac03e95a1bf95')
('780e74673b601790037fc0b17dde56fe', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)

# Output
('385f6c538568ad3a33cf84f61d94144c', '8c89fe1a592c50a700b6d5cbb02dba8b')
('8db65d0a3bdf87ae0f135b9766a260dd', '9c4879fb550fded8be9ac03e95a1bf95')
('2a7293fc1308b8a572091d7c76d20205', '5b575765bbe176e732b8eb4dc1bf2822')

This is different from the behavior of the sklearn splitter:

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

#Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

I think this is problematic for two reasons:

  1. Since BalancedClassificationDataCVSplit is the default splitter in automl, it means our pipelines are evaluated on different splits
  2. Since split modifies the state of the data splitter it means we'll have different results between the sequential and parallel engines.
@freddyaboulton freddyaboulton added the bug Issues tracking problems with existing features. label Mar 16, 2021
@dsherry
Copy link
Contributor

dsherry commented Mar 18, 2021

Thanks for pointing this out.

Personally, this behavior doesn't bother me. As long as every time we initialize with a certain seed, we get the same sequence of output after that point, we're good. I'd be concerned if we were not respecting the random seed; but that's not what this issue tracks.

My recommendation: do nothing. As such, closing.

@freddyaboulton if you disagree about this behavior, let's duke it out, I mean talk 😅

@dsherry dsherry closed this as completed Mar 18, 2021
@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Mar 18, 2021

@dsherry I think this worth changing for two reasons:

  1. It introduces variances to automl search because different pipelines are evaluated on different data. This makes the rankings table slightly misleading because the scores are not computed on the same data.
  2. It's bad for parallel automl search

Let me elaborate on 2. With the current behavior, the sequential engine is expected to modify the state of the data splitter throughout search. In parallel evalml, we pickle the data splitter and send it to workers to compute the split. Since the workers get a copy of the splitter, they don't modify the state of the original data splitter.

This introduces a difference in behavior in between the sequential and parallel engines because the splits would not match depending on the order the pipeline is evaluated! This means that the same pipeline/parameter combo would get different results in the sequential engine and parallel engine and I think that's undesirable.

In my opinion, point 1 is reason enough to fix this because all of our pipelines should be evaluated on the same data if we want to be able to compare them meaningfully. But as we move towards parallel evalml, I think it's important we make sure that modifying global state is not part of our expected behavior.

@dsherry dsherry reopened this Mar 19, 2021
@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Mar 19, 2021

The plan moving forward:

  1. Fix this issue by modifying the BalancedClassificationDataCVSplit
  2. Longer term we'd like to write tests that verify we don't feed different splits to different pipelines in automl search.

Thanks for the discussion everyone!

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jul 13, 2021

The immediate issue was solved when we refactored the samplers to be pipeline components. This issue now tracks adding test coverage that all pipelines in AutoMLSearch get the same data in every split!

There are some tests in #2210 that we may be able to leverage for that.

@freddyaboulton freddyaboulton changed the title BalancedClassificationDataCVSplit produces different splits each time it's called Ensure all pipelines in AutoMLSearch receive the same data splits Jul 13, 2021
@freddyaboulton freddyaboulton added testing Issues related to testing. and removed bug Issues tracking problems with existing features. labels Jul 13, 2021
@freddyaboulton freddyaboulton self-assigned this Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment