You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have had a warning quietly being raised in our CI for a while now, like here, that says:
/home/runner/micromamba/envs/temp/lib/python3.11/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator StandardScaler from version 1.4.0 when using version 1.4.1.post1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
The way we load and unload the StandardScaler object is explicitly 'unsupported and inadvisable' in their docs.
Path Forward
It's not immediately obvious what to do, though some options we have discussed online include:
ignoring this (tempting) and hiding it from the user, which could lead to issues with correctness or running at all
allow it to continue being raised, allowing the users to decide if they want to take the risk
elevate this to an error and prevent people from doing this
try and fix it (options include manually saving the means and variances and re-doing the scaling ourselves, retraining the scalers at inference time, etc. etc. )
Please chime in with your thoughts on this, v2 developers and users.
The text was updated successfully, but these errors were encountered:
the last option isn't too bad IMO. As-is, we "define" (air-quotes) the following scalers:
for each input: V, E, V_f, X_f
essentially a list of shape $(m, 4, \ast, 2)$, where $m$ is the number of input, $4$ is the number of input scalers, $\ast$ is the dimension of the respective input, and $2$ is for the means and variances
one for the output: a list of shape: $(t, 2)$, where $t$ is the number of output tasks
AFAICT, the only reason we use the StandardScaler is the nice interface (it ignores nans and is object-oriented). We can easily just write our own version (like in v1) or hack it to do what we want:
n, p=10, 4X=np.random.rand(n, p)
scaler=StandardScaler()
scaler.mean_=X.mean(0)
scaler.scale_=X.std(0)
so all that would mean in practice is saving the train statistics and just setting the attributes of a new StandardScaler at test-time. Per #726, we can get output scaler serialization for free by making it "part" of the model, but figuring out where to place the input scaling info is a more open question.
For now, I think it is fine to bump this to v2.1. It currently raises a warning, so at least users can decide for themselves in the mean time. Additionally, we see this in the CI because developers have different version locally than the github actions use with a fresh install. I think it would be good to create the checkpoints with a fresh install once v2.0 is ready and the checkpoint files won't change anymore.
Bug Risk
We have had a warning quietly being raised in our CI for a while now, like here, that says:
The way we load and unload the
StandardScaler
object is explicitly 'unsupported and inadvisable' in their docs.Path Forward
It's not immediately obvious what to do, though some options we have discussed online include:
Please chime in with your thoughts on this, v2 developers and users.
The text was updated successfully, but these errors were encountered: