New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Trigger a copy when copy=False and X is read-only #28824
Comments
|
I'm okay with Aside: I alway disliked the semantics of how |
I see a preference toward I see writeable as another property of the resulting array that we want to enforce, exactly like If I managed to convince you, great, otherwise I'm fine with |
I'm not sure what I think regarding third value vs new argument. It feels like
It seems useful to not raise an error in these cases. After thinking about this a bit more: what is the problem we are trying to solve? #13986 makes me think we are trying to avoid bugs ( |
This is exactly the main problem I want to solve: Stop having to worry that a read-only array being created at some point can cause an error. To me it should be covered by the fact that One way of ending up with a read-only array is with joblib's memmapping. It caused many issues, the last one being #28781. Thinking more about it, maybe there are 2 things to consider here:
In this regard, I think a new parameter is more appropriate. I would propose
|
Thanks for the question @betatim ! At first I was frustrated, feeling that I had already explained it precisely. But in fact it made me think deeper and find more precisely the underlying details of this issue. |
For my understanding: If we introduce |
Essentially in the class InplaceEstimator: # e.g StandardScaler (mostly transformers but not only)
def __init__(self, copy=True):
self.copy = copy
def fit(self, X, y):
# This estimator performs inplace operations, we need a writeable array
self._validate_data(X, y, copy=self.copy, writeable=True)
class NotInplaceEstimator: # e.g. LogisticRegression
def __init__(self):
pass
def fit(self, X, y):
# This estimator only reads the input data, we don't have to modify the
# permissions (so nothing to change w.r.t current code)
self._validate_data(X, y) |
Thank you for the examples. This makes more sense to me now. How does the default |
Highly related to #14481 and maybe a little bit to #13986.
My understanding of the
copy=False
parameter of estimators is "allow inplace modifications of X".When avoiding a copy is not possible (X doesn't have the right dtype or memory layout for instance), a copy is still triggered. I believe that X being read-only is a valid reason for still triggering a copy.
My main argument is that the user isn't always in control of the permissions of an input array within the whole pipeline. Especially when joblib parallelism is enabled, which may create read-only memmaps. We've have a bunch of issues because of that, the latest being #28781. And it's poorly tested because it requires big arrays which we try to avoid in the tests (although joblib 1.13 makes it easy to trigger with small arrays).
I wouldn't make
check_array(copy=False)
always trigger a copy when X is read-only because the semantic of thecopy
param ofcheck_array
is not the same as the one of estimators. We could introduce a new param in check_array, likecopy_if_readonly
?check_array(copy=False, copy_if_readonly=False)
check_array(copy=self.copy, copy_if_readonly=True)
It could also be a third option for copy in check_array: True, False, "if_readonly":
check_array(copy=False)
check_array(copy=self.copy or "if_readonly")
The text was updated successfully, but these errors were encountered: