New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure all pipelines in AutoMLSearch receive the same data splits #1982
Comments
Thanks for pointing this out. Personally, this behavior doesn't bother me. As long as every time we initialize with a certain seed, we get the same sequence of output after that point, we're good. I'd be concerned if we were not respecting the random seed; but that's not what this issue tracks. My recommendation: do nothing. As such, closing. @freddyaboulton if you disagree about this behavior, let's duke it out, I mean talk 😅 |
@dsherry I think this worth changing for two reasons:
Let me elaborate on 2. With the current behavior, the sequential engine is expected to modify the state of the data splitter throughout search. In parallel evalml, we pickle the data splitter and send it to workers to compute the split. Since the workers get a copy of the splitter, they don't modify the state of the original data splitter. This introduces a difference in behavior in between the sequential and parallel engines because the splits would not match depending on the order the pipeline is evaluated! This means that the same pipeline/parameter combo would get different results in the sequential engine and parallel engine and I think that's undesirable. In my opinion, point 1 is reason enough to fix this because all of our pipelines should be evaluated on the same data if we want to be able to compare them meaningfully. But as we move towards parallel evalml, I think it's important we make sure that modifying global state is not part of our expected behavior. |
The plan moving forward:
Thanks for the discussion everyone! |
The immediate issue was solved when we refactored the samplers to be pipeline components. This issue now tracks adding test coverage that all pipelines in AutoMLSearch get the same data in every split! There are some tests in #2210 that we may be able to leverage for that. |
Repro
This is different from the behavior of the sklearn splitter:
I think this is problematic for two reasons:
split
modifies the state of the data splitter it means we'll have different results between the sequential and parallel engines.The text was updated successfully, but these errors were encountered: