[BUG (RISK?)]: v2 relies on unsupported usage of scikit-learn #752

JacksonBurns · 2024-04-02T19:03:37Z

Bug Risk

We have had a warning quietly being raised in our CI for a while now, like here, that says:

/home/runner/micromamba/envs/temp/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator StandardScaler from version 1.4.0 when using version 1.4.1.post1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
  https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
    warnings.warn(

The way we load and unload the StandardScaler object is explicitly 'unsupported and inadvisable' in their docs.

Path Forward

It's not immediately obvious what to do, though some options we have discussed online include:

ignoring this (tempting) and hiding it from the user, which could lead to issues with correctness or running at all
allow it to continue being raised, allowing the users to decide if they want to take the risk
elevate this to an error and prevent people from doing this
try and fix it (options include manually saving the means and variances and re-doing the scaling ourselves, retraining the scalers at inference time, etc. etc. )

Please chime in with your thoughts on this, v2 developers and users.

The text was updated successfully, but these errors were encountered:

davidegraff · 2024-04-02T19:21:48Z

the last option isn't too bad IMO. As-is, we "define" (air-quotes) the following scalers:

for each input: V, E, V_f, X_f
essentially a list of shape $(m, 4, \ast, 2)$, where $m$ is the number of input, $4$ is the number of input scalers, $\ast$ is the dimension of the respective input, and $2$ is for the means and variances
one for the output: a list of shape: $(t, 2)$, where $t$ is the number of output tasks

AFAICT, the only reason we use the StandardScaler is the nice interface (it ignores nans and is object-oriented). We can easily just write our own version (like in v1) or hack it to do what we want:

n, p = 10, 4
X = np.random.rand(n, p)
scaler = StandardScaler()
scaler.mean_ = X.mean(0)
scaler.scale_ = X.std(0)

so all that would mean in practice is saving the train statistics and just setting the attributes of a new StandardScaler at test-time. Per #726, we can get output scaler serialization for free by making it "part" of the model, but figuring out where to place the input scaling info is a more open question.

KnathanM · 2024-04-12T17:20:41Z

For now, I think it is fine to bump this to v2.1. It currently raises a warning, so at least users can decide for themselves in the mean time. Additionally, we see this in the CI because developers have different version locally than the github actions use with a fresh install. I think it would be good to create the checkpoints with a fresh install once v2.0 is ready and the checkpoint files won't change anymore.

JacksonBurns · 2024-04-12T17:41:21Z

@KnathanM @kevingreenman maybe we pin this issue so it's easier to find?

JacksonBurns added the bug Something isn't working label Apr 2, 2024

davidegraff mentioned this issue Apr 2, 2024

[BUG]: AttributeError: Can't get attribute 'AttributeDict' on <module 'lightning.fabric.utilities.data' #753

Closed

KnathanM added this to the v2.0.0 milestone Apr 4, 2024

KnathanM modified the milestones: v2.0.0, v2.1.0 Apr 12, 2024

kevingreenman pinned this issue Apr 12, 2024

KnathanM self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG (RISK?)]: v2 relies on unsupported usage of scikit-learn #752

[BUG (RISK?)]: v2 relies on unsupported usage of scikit-learn #752

JacksonBurns commented Apr 2, 2024

davidegraff commented Apr 2, 2024

KnathanM commented Apr 12, 2024

JacksonBurns commented Apr 12, 2024

[BUG (RISK?)]: v2 relies on unsupported usage of scikit-learn #752

[BUG (RISK?)]: v2 relies on unsupported usage of scikit-learn #752

Comments

JacksonBurns commented Apr 2, 2024

Bug Risk

Path Forward

davidegraff commented Apr 2, 2024

KnathanM commented Apr 12, 2024

JacksonBurns commented Apr 12, 2024