Ensure all pipelines in AutoMLSearch receive the same data splits #1982

freddyaboulton · 2021-03-16T15:43:57Z

Repro

import joblib
from evalml.demos import load_fraud
from evalml.preprocessing.data_splitters import BalancedClassificationDataCVSplit

splitter = BalancedClassificationDataCVSplit(n_splits=3, random_seed=0, shuffle=True)

X, y = load_fraud(5000)
X = X.to_dataframe()
y = y.to_series().astype("int")

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('75f1b95d7ce307ac6c793055330969aa', '8c89fe1a592c50a700b6d5cbb02dba8b')
('f8c849bbfbed37c13f66c5c742e237cb', '9c4879fb550fded8be9ac03e95a1bf95')
('cdc21f0d6bbf45459c9695258f7f04dc', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('bf462b82af243c552ac48acad2dfd748', '8c89fe1a592c50a700b6d5cbb02dba8b')
('b8341b536c63c7957c099b05e315f49c', '9c4879fb550fded8be9ac03e95a1bf95')
('780e74673b601790037fc0b17dde56fe', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)

# Output
('385f6c538568ad3a33cf84f61d94144c', '8c89fe1a592c50a700b6d5cbb02dba8b')
('8db65d0a3bdf87ae0f135b9766a260dd', '9c4879fb550fded8be9ac03e95a1bf95')
('2a7293fc1308b8a572091d7c76d20205', '5b575765bbe176e732b8eb4dc1bf2822')

This is different from the behavior of the sklearn splitter:

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

#Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

I think this is problematic for two reasons:

Since BalancedClassificationDataCVSplit is the default splitter in automl, it means our pipelines are evaluated on different splits
Since split modifies the state of the data splitter it means we'll have different results between the sequential and parallel engines.

The text was updated successfully, but these errors were encountered:

dsherry · 2021-03-18T17:58:53Z

Thanks for pointing this out.

Personally, this behavior doesn't bother me. As long as every time we initialize with a certain seed, we get the same sequence of output after that point, we're good. I'd be concerned if we were not respecting the random seed; but that's not what this issue tracks.

My recommendation: do nothing. As such, closing.

@freddyaboulton if you disagree about this behavior, let's duke it out, I mean talk 😅

freddyaboulton · 2021-03-18T18:13:08Z

@dsherry I think this worth changing for two reasons:

It introduces variances to automl search because different pipelines are evaluated on different data. This makes the rankings table slightly misleading because the scores are not computed on the same data.
It's bad for parallel automl search

Let me elaborate on 2. With the current behavior, the sequential engine is expected to modify the state of the data splitter throughout search. In parallel evalml, we pickle the data splitter and send it to workers to compute the split. Since the workers get a copy of the splitter, they don't modify the state of the original data splitter.

This introduces a difference in behavior in between the sequential and parallel engines because the splits would not match depending on the order the pipeline is evaluated! This means that the same pipeline/parameter combo would get different results in the sequential engine and parallel engine and I think that's undesirable.

In my opinion, point 1 is reason enough to fix this because all of our pipelines should be evaluated on the same data if we want to be able to compare them meaningfully. But as we move towards parallel evalml, I think it's important we make sure that modifying global state is not part of our expected behavior.

freddyaboulton · 2021-03-19T16:51:42Z

The plan moving forward:

Fix this issue by modifying the BalancedClassificationDataCVSplit
Longer term we'd like to write tests that verify we don't feed different splits to different pipelines in automl search.

Thanks for the discussion everyone!

freddyaboulton · 2021-07-13T18:35:55Z

The immediate issue was solved when we refactored the samplers to be pipeline components. This issue now tracks adding test coverage that all pipelines in AutoMLSearch get the same data in every split!

There are some tests in #2210 that we may be able to leverage for that.

freddyaboulton added the bug Issues tracking problems with existing features. label Mar 16, 2021

dsherry closed this as completed Mar 18, 2021

dsherry reopened this Mar 19, 2021

dsherry self-assigned this Mar 22, 2021

dsherry mentioned this issue Mar 25, 2021

Ensure pipelines receive an identical set of CV or TV splits #2034

Closed

dsherry mentioned this issue Apr 30, 2021

Spike: have ComponentGraph skip samplers in transform/predict, and refactor sampler code #2210

Closed

chukarsten added this to the One Week Sprint milestone May 27, 2021

dsherry removed their assignment Jul 13, 2021

freddyaboulton changed the title ~~BalancedClassificationDataCVSplit produces different splits each time it's called~~ Ensure all pipelines in AutoMLSearch receive the same data splits Jul 13, 2021

freddyaboulton added testing Issues related to testing. and removed bug Issues tracking problems with existing features. labels Jul 13, 2021

freddyaboulton mentioned this issue Jul 14, 2021

Test that all pipelines get the same data splits #2513

Merged

freddyaboulton self-assigned this Jul 14, 2021

freddyaboulton closed this as completed in #2513 Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

freddyaboulton commented Mar 16, 2021 •

edited

dsherry commented Mar 18, 2021

freddyaboulton commented Mar 18, 2021 •

edited

freddyaboulton commented Mar 19, 2021 •

edited

freddyaboulton commented Jul 13, 2021 •

edited

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

Ensure all pipelines in AutoMLSearch receive the same data splits #1982

Comments

freddyaboulton commented Mar 16, 2021 • edited

dsherry commented Mar 18, 2021

freddyaboulton commented Mar 18, 2021 • edited

freddyaboulton commented Mar 19, 2021 • edited

freddyaboulton commented Jul 13, 2021 • edited

freddyaboulton commented Mar 16, 2021 •

edited

freddyaboulton commented Mar 18, 2021 •

edited

freddyaboulton commented Mar 19, 2021 •

edited

freddyaboulton commented Jul 13, 2021 •

edited