Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add writeable parameter to check_array #29018

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented May 14, 2024

closes #28824
closes #14481
Fixes #28899

This PR proposes to add a writeable parameter to check_array. It acts as a toggle: it can be True meaning it's set or None (unset). If True, check_array will make sure that the returned array is writeable (which may require a copy). If None, the writeability of the array is left untouched.

It has the same status as the dtype or order parameters. They define desired properties of the output array. Sometimes they can only be applied by making a copy, even if copy=False.

Writeable arrays are required in estimators that can perform inplace operations. These estimators have a copy or copy_X parameter and currently they raise an error if copy=False and X is read-only. This behavior seems expected of rthe basic use case where the user has full control over the input array of the estimator. But in a complex pipeline, in can happen that an array is created in read-only mode (e.g. from joblib's auto memmapping) at an intermediate step which triggers an (unexpected to me) error, the last one being #28781.

This PR also presents an alternative to #28348, which isn't safe because changing the writeable flag of an array is not possible if the array doesn't own its data. And it happens even if the array is aleardy writeable, just trying to set the flag is not allowed. That's what happens in #28899.

I added a common test, which for now fails as expected for most estimators because I haven't applied the writeable param to all inplace estimators, so it shows the current behavior. A few of them still pass:

  • AffinityPropagation: already applied writeable in this PR
  • FactorAnalysis: already applied writeable in this PR
  • SimpleImputer: already applied writeable in this PR
  • Birch: doesn't perform inplace operations so the param copy should be deprecated (Deprecate copy in Birch #29092)
  • TheilSenRegressor: doesn't perform inplace operations so the copy param should be deprecated (Deprecate copy_X in TheilSenRegressor #29098)
  • KernelPCA: seems to always make a copy even if copy=False so needs investigation (Fix Make centering inplace in KernelPCA #29100)
  • OrthogonalMatchingPursuitCV: seems to always make a copy even if copy=False so needs investigation
    (I found that the only way to modify the input data is to pass a custom splitter that returns slices instead of arrays of indices which goes against our splitter doc, see doc, so we could deprecate the copy param or leave it as is because we don't want to put effort on this estimator and don't want to break user code)

cc/ @thomasjpfan Here's a proposed implementation of what we're discussing in #28824 (comment)

Copy link

github-actions bot commented May 14, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 08cbaee. Link to the linter CI: here

@glemaitre glemaitre self-assigned this May 16, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we are adding a new parameter, this feels like a big enough bug fix to be in 1.5.1.

sklearn/tests/test_common.py Outdated Show resolved Hide resolved
_set_checking_parameters(estimator)

# The following estimators can work inplace only with certain settings
if estimator.__class__.__name__ == "HDBSCAN":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are in our testing code sklearn/tests/test_common.py can we use isinstance here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the class name because it's what we use in _set_checking_parameters(estimator). Otherwise we have to import all all estimators for which we want to set a specific param which feels a bit cumbersome.

@glemaitre glemaitre removed their assignment May 20, 2024
@glemaitre glemaitre self-requested a review May 20, 2024 13:35
@jeremiedbb
Copy link
Member Author

I modified the behavior a bit. Previously I tried to change the writeability without copy and only copy if it failed. I felt that it was not a healthy behavior for the user. If a user calls StandardScaler(copy=False).fit_transform(X) where X is read-only, it's an ambiguous situation and it should not be scikit-learn's role to decide which path to chose. So now we always make a copy in that situation. There's one exception that comes from #28348, where it's a intermediate array that is created in read-only mode.

Arguably, we could raise an error in the ambiguous copy=False + read-only situation, but I think it's better to not raise and make estimators work in that case because intermediate arrays in complex pipelines can be read-only due to joblib's auto-memmapping (last time it happened was #28781). There are always workarounds but with this PR, we should not have to worry about this issue in the future.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented May 23, 2024

I also found a use case that is currently failing unexpectedly in main (due to #28348) and works with this PR:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X, y = make_regression()
X.flags.writeable = False
df = pd.DataFrame(X, copy=False)  # dataframe backed by a read-only array

LinearRegression().fit(df, y)
# fails, although LinearRegression doesn't even want to do any inplace operation

@jeremiedbb
Copy link
Member Author

When we're happy with the behavior, I'll set the new writeable param to all estimators that want to perform inplace operations.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the proposed solution. This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented May 24, 2024

Alright, I'm going to add the new kwarg in estimators. In the mean time, I opened issues and PRs for the estimators with an unexpected behavior regarding the copy param and update the PR description with corresponding links.

This feels like a big enough bug fix to be in 1.5.1, but it adds a new parameter, where we usually wait for 1.6.

I can propose an easy quick fix for #28899 in a separate PR that we'd include in 1.5.1, and then the rest of this PR would fit in 1.6.

Edited: quick fix for #28899 in #29103.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants