MSE is negative when returned by cross_val_score #2439

tdomhan · 2013-09-12T13:29:25Z

The Mean Square Error returned by sklearn.cross_validation.cross_val_score is always a negative. While being a designed decision so that the output of this function can be used for maximization given some hyperparameters, it's extremely confusing when using cross_val_score directly. At least I asked myself how a the mean of a square can possibly be negative and thought that cross_val_score was not working correctly or did not use the supplied metric. Only after digging in the sklearn source code I realized that the sign was flipped.

This behavior is mentioned in make_scorer in scorer.py, however it's not mentioned in cross_val_score and I think it should be, because otherwise it makes people think that cross_val_score is not working correctly.

jaquesgrobler · 2013-09-12T14:39:49Z

You're referring to

greater_is_better : boolean, default=True

Whether score_func is a score function (default), meaning high is good, 
or a loss function, meaning low is good. In the latter case, the scorer 
object will sign-flip the outcome of the score_func.

in http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html
? (just for reference's sake)

I agree that that it can be more clear in cross_val_score docs

Thanks for reporting

ogrisel · 2013-09-12T15:55:34Z

Indeed we overlooked that issue when doing the Scorer refactoring. The following is very counter-intuitive:

>>> import numpy as np
>>> from sklearn.datasets import load_boston
>>> from sklearn.linear_model import RidgeCV
>>> from sklearn.cross_validation import cross_val_score

>>> boston = load_boston()
>>> np.mean(cross_val_score(RidgeCV(), boston.data, boston.target, scoring='mean_squared_error'))
-154.53681864311497

/cc @larsmans

ogrisel · 2013-09-12T16:18:16Z

BTW I don't agree that it's a documentation issue. It's cross_val_score should return the value with the sign that matches the scoring name. Ideally the GridSearchCV(*params).fit(X, y).best_score_ should be consistent too. Otherwise the API is very confusing.

tdomhan · 2013-09-12T16:27:03Z

I also agree that a change to return the actual MSE without the sign switched would be the way better option.

The scorer object could just store the greater_is_better flag and whenever the scorer is used the sign could be flipped in case it's needed, e.g. in GridSearchCV.

larsmans · 2013-09-13T08:14:50Z

I agree that we have a usability issue here, but I don't fully agree with @ogrisel's solution that we should

return the value with the sign that matches the scoring name

because that's an unreliable hack in the long run. What if someone defines a custom scorer with a name such as mse? What if they do follow the naming pattern but wrap the scorer in a decorator that changes the name?

The scorer object could just store the greater_is_better flag and whenever the scorer is used the sign could be flipped in case it's needed, e.g. in GridSearchCV.

This is what scorers originally did, during development between the 0.13 and 0.14 releases and it made their definition a lot harder. It also made the code hard to follow because the greater_is_better attribute seemed to disappear in the scorer code, only to reappear in the middle of the grid search code. A special Scorer class was needed to do something that ideally, a simple function would do.

I believe that if we want to optimize scores, then they should be maximized. For the sake of user-friendlyness, I think we might introduce a parameter score_is_loss ∈ ["auto", True, False] that only changes the display of scores and can use a heuristic based on the built-in names.

larsmans · 2013-09-13T08:47:28Z

That was a hurried response because I had to get off the train. What I meant by "display" is really the return value from cross_val_score. I think scorers should be simple and uniform and the algorithms should always maximize.

This does introduce an asymmetry between built-in and custom scorers.

Ping @GaelVaroquaux.

jaquesgrobler · 2013-09-13T08:56:30Z

I like the score_is_loss solution, or something to that effect.. the sign change to match the scoring name seems hard to maintain could cause problems as @larsmans mentioned

tdomhan · 2013-09-28T11:40:54Z

what's the conclusion, which solution should we go for? :)

amelio-vazquez-reina · 2013-10-23T22:16:14Z

@tdomhan @jaquesgrobler @larsmans Do you know if this applies to r2 as well? I am noticing that the r2 scores returned by GridSearchCV are also mostly negative for ElasticNet, Lasso and Ridge.

larsmans · 2013-10-23T22:33:08Z

R² can be either positive or negative, and negative simply means your model is performing very poorly.

jnothman · 2014-01-17T05:26:52Z

IIRC, @GaelVaroquaux was a proponent of returning a negative number when greater_is_better=False.

larsmans · 2014-01-17T14:34:46Z

r2 is a score function (greater is better), so that should be positive if your model is any good -- but it's one of the few performance metrics that can actually be negative, meaning worse than 0.

mblondel · 2014-02-04T08:41:21Z

What is the consensus on this issue? In my opinion, cross_val_score is an evaluation tool, not a model selection one. It should thus return the original values.

I can fix it in my PR #2759, since the changes I made make it really easy to fix. The trick is to not flip the sign upfront but, instead, to access the greater_is_better attribute on the scorer when doing grid search.

GaelVaroquaux · 2014-02-04T08:43:47Z

What is the consensus on this issue? In my opinion, cross_val_score is
an evaluation tool, not a model selection one. It should thus return
the original values.

Special case are varying behaviors are a source of problems in software.

I simply think that we should rename "mse" to "negated_mse" in the list
of acceptable scoring strings.

mblondel · 2014-02-04T08:46:22Z

What if someone defines a custom scorer with a name such as mse? What if they do follow the naming pattern but wrap the scorer in a decorator that changes the name?

I don't think that @ogrisel was suggesting to use name matching, just to be consistent with the original metric. Correct me if I'm wrong @ogrisel.

mblondel · 2014-02-04T08:58:09Z

I simply think that we should rename "mse" to "negated_mse" in the list of acceptable scoring strings.

That's completely unintuitive if you don't know the internals of scikit-learn. If you have to bend the system like that, I think it's a sign that there's a design problem.

GaelVaroquaux · 2014-02-04T09:34:37Z

That's completely unintuitive if you don't know the internals of scikit-learn.
If you have to bend the system like that, I think it's a sign that there's a
design problem.

I disagree. Humans understand things with a lot of prior knowledge and
context. They are all but systematic. Trying to embed this in software
gives shopping-list like set of special cases. Not only does it make
software hard to maintain, but also it means that people who do not have
in mind those exceptions run into surprising behaviors and write buggy
code using the library.

mblondel · 2014-02-04T10:50:08Z

What special case do you have in mind?

To be clear, I think that the cross-validation scores stored in the GridSearchCV object should also be the original values (not with sign flipped).

AFAIK, flipping the sign was introduced so as to make the grid search implementation a little simpler but was not supposed to affect usability.

GaelVaroquaux · 2014-02-04T11:50:37Z

What special case do you have in mind?

Well, the fact that for some metrics bigger is better, whereas for others
it is the opposite.

AFAIK, flipping the sign was introduced so as to make the grid search
implementation a little simpler but was not supposed to affect
usability.

It's not about grid search, it's about separation of concerns: scores
need to be useable without knowing anything about them, or else code to
deal with their specificities will spread to the whole codebase. There is
already a lot of scoring code.

mblondel · 2014-02-04T13:51:38Z

But that's somewhat postponing the problem to user code. Nobody wants to plot "negated MSE" so users will have to flip signs back in their code. This is inconvenient, especially for multiple-metric cross-validation reports (PR #2759), as you need to handle each metric individually. I wonder if we can have the best of both worlds: generic code and intuitive results.

GaelVaroquaux · 2014-02-04T13:55:39Z

But that's somewhat postponing the problem to user code. Nobody wants
to plot "negated MSE" so users will have to flip signs back in their
code.

Certainly not the end of the world. Note that when reading papers or
looking at presentations I have the same problem: when the graph is not
well done, I loose a little bit of time and mental bandwidth trying to
figure if bigger is better or not.

This is inconvenient, especially for multiple-metric cross-validation
reports (PR #2759), as you need to handle each metric individually.

Why. If you just accept that its always bigger is better, it makes
everything easier, including the interpretation of results.

I wonder if we can have the best of both worlds: generic code and
intuitive results.

The risk is to have very complex code that slows us down for maintainance
and development. Scikit-learn is picking up weight.

mblondel · 2014-02-04T14:09:09Z

If you just accept that its always bigger is better

That's what she said :)

More seriously, I think one reason this is confusing people is because the output of cross_val_score is not consistent with the metrics. If we follow your logic, all metrics in sklearn.metrics should follow "bigger is better".

GaelVaroquaux · 2014-02-04T14:11:31Z

That's what she said :)

Nice one!

More seriously, I think one reason this is confusing people is because
the output of cross_val_score is not consistent with the metrics. If we
follow your logic, all metrics in sklearn.metrics should follow "bigger
is better".

Agreed. That's why I like the idea of changing the name: it would pop up
to people's eyes.

jnothman · 2014-02-04T21:30:50Z

More seriously, I think one reason this is confusing people is because the output of cross_val_score is not consistent with the metrics.

And this in turn makes scoring seem more mysterious than it is.

Huitzilo · 2015-05-20T14:39:33Z

Got bitten by this today in 0.16.1 when trying to do linear regression. While the sign of the score is apparently not flipped anymore for classifiers, it is still flipped for linear regression. To add to the confusion, LinearRegression.score() returns a non-flipped version of the score.

I'd suggest to make it all consistent and return the non-sign-flipped score for linear models as well.

Example:

from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn import datasets
iris = datasets.load_iris()
nb = GaussianNB()
scores = cross_validation.cross_val_score(nb, iris.data, iris.target)
print("NB score:\t  %0.3f" % scores.mean() )

iris_reg_data = iris.data[:,:3]
iris_reg_target = iris.data[:,3]
lr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(lr, iris_reg_data, iris_reg_target)
print("LR score:\t %0.3f" % scores.mean() )

lrf = lr.fit(iris_reg_data, iris_reg_target)
score = lrf.score(iris_reg_data, iris_reg_target)
print("LR.score():\t  %0.3f" % score )

This gives:

NB score:     0.934    # sign is not flipped
LR score:    -0.755    # sign is flipped
LR.score():   0.938    # sign is not flipped

amueller · 2015-05-20T15:14:25Z

Cross-validation flips all signs of models where greater is better. I still disagree with this decision. I think the main proponent of it were @GaelVaroquaux and maybe @mblondel [I remembered you refactoring the scorer code].

amueller · 2015-05-20T15:16:08Z

Oh never mind, all the discussion is above.
I feel flipping the sign by default in mse and r2 is even less intuitive :-/

mblondel · 2015-06-04T10:31:04Z

And hinge_loss I guess?

mblondel · 2015-06-04T11:03:08Z

Adding the neg_ prefix to all those losses feels awkward.

An idea would be to return the original scores (without sign flip) but instead of returning an ndarray, we return a class which extends ndarray with methods like best(), arg_best(), best_sorted(). This way the results are unsurprising and we have convenience methods for retrieving the best results.

larsmans · 2015-06-04T11:11:31Z

There's no scorer for hinge loss (and I've never seen it being used for evaluation).

amueller · 2015-06-04T18:12:42Z

The scorer doesn't return a numpy array, it returns a float, right?
we could return a score object that has a custom ">" but looks like a float.
That feels more contrived to me than the previous solution, which was tagging the scorer with a bool "lower_is_better" which was then used in GridSearchCV.

mblondel · 2015-06-04T23:55:11Z

cross_val_score returns an array.

mblondel · 2015-06-05T00:18:48Z

Actually the scores returned by cross_val_score usually don't need to be sorted, just averaged.

Another idea is to add a sorted method to _BaseScorer.

my_scorer = make_scorer(my_metric, greater_is_better=False)
scores = my_scorer.sorted(scores)  # takes into account my_scorer._sign
best = scores[0]

amueller · 2015-06-05T17:07:57Z

cross_val_score returns an array, but the scorers return a float. I feel it would be odd to have specific logic in cross_val_score because you'd like to have the same behavior in GridSearchCV and in all other CV objects.

You'd also need an argsort method, because in GridSearchCV you want the best score and the best index.

jenifferYingyiWu · 2016-03-15T12:55:05Z

How to implement "estimate the means and variances of the workers' errors from the control questions, then compute the weighted average after removing the estimated bias for the predictions " by scikit-learn?

amueller · 2016-08-02T14:43:57Z

IIRC we discussed this in the sprint (last summer?!) and decided to go with neg_mse (or was it neg-mse) and deprecate all scorers / strings where we have a negative sign now.
Is this still the consensus? We should do that before 0.18 then.
Ping @GaelVaroquaux @agramfort @jnothman @ogrisel @raghavrv

agramfort · 2016-08-02T14:48:29Z

yes we agreed on neg_mse AFAIK

raghavrv · 2016-08-02T14:48:36Z

It was neg_mse

ogrisel · 2016-08-27T07:46:21Z

We also need:

neg_log_loss
neg_mean_absolute_error
neg_median_absolute_error

shreyassks · 2018-10-29T05:55:09Z

model = Sequential()
keras.layers.Flatten()
model.add(Dense(11, input_dim=3, kernel_initializer = keras.initializers.he_normal(seed = 2),
kernel_regularizer = regularizers.l2(2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(8, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(4, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.1)
model.add(Dense(1, kernel_initializer = keras.initializers.he_normal(seed = 2)))
keras.layers.LeakyReLU(alpha = 0.2)
adag = RMSprop(lr = 0.0002)
model.compile(loss=losses.mean_squared_error,
optimizer=adag
)
history = model.fit(X_train, Y_train, epochs=2000,
batch_size=20, shuffle = True)

How to cross validate the above code? I want leave one out cross validation method to be used in this.

jolespin · 2019-05-14T00:44:22Z

@shreyassks this isn't the correct place for your question but I would check this out: https://keras.io/scikit-learn-api . Wrap your network in a scikit-learn estimator then use w/ model_selection.cross_val_score

TomMeowMeow · 2019-06-03T08:41:41Z

Yes. I totally agree! This also happened to Brier_score_loss, it works perfectly fine using Brier_score_loss, but it gets confusing when it comes from the GridSearchCV, the negative Brier_score_loss returns. At least, it would be better output something like, because Brier_score_loss is a loss (the lower the better), the scoring function here flip the sign to make it negative.

ghost · 2019-10-06T17:09:01Z

The idea is that cross_val_score should entirely focus on the absolute value of the result. In my knowledge, importance of negative sign (-) obtained for MSE (mean squared error) in cross_val_score is not predefined. Let's wait for the updated version of sklearn where this issue is taken care of.

pritishban · 2019-12-17T17:00:25Z

For Regression usecase:
model_score= cross_val_score(model, df_input, df_target, scoring='neg_mean_squared_error', cv=3)
I am getting the values as:

SVR:
[-6.20938025 -1.397376 -1.94519 ]
-3.183982080147279

Linear Regression:
[-5.94898085 -9.30931808 -1.15760676]
-5.4719685646934275

Lasso:
[ -7.22363814 -10.47734135 -2.20807684]
-6.6363521107522345

Ridge:
[-5.95990385 -4.17946756 -1.36885809]
-3.8360764993832004

So which one is best ?
SVR ?

pritishban · 2019-12-17T17:06:33Z

For Regression usecase:
I am getting different results when I use
(1) "cross_val_score" with scoring='neg_mean_squared_error'
and
(2) For the same inputs when when I use "GridSearchCV" and check the 'best_score_'

For Regression models which one is better ?

"cross_val_score" with scoring='neg_mean_squared_error'
(OR)
use "GridSearchCV" and check the 'best_score_'

amueller · 2019-12-17T22:25:23Z

@pritishban
You're asking a usage question. The issue tracker is mainly for bugs and new features. For usage questions, it is recommended to try Stack Overflow or the Mailing List.

mblondel mentioned this issue Jan 17, 2014

[WIP] Multiple-metric grid search #2759

Closed

mblondel mentioned this issue Feb 4, 2014

Fixing MAE and MSE errors. should not be negative when used in cross_val #2823

Closed

cyril94440 mentioned this issue Mar 11, 2014

Compute cross-validation mean-squared error for gender prediction tapilab/ctrosset#7

Open

mblondel mentioned this issue Jul 27, 2015

More intuitive scoring argument for loss and error #5023

Closed

shaynekang mentioned this issue Mar 27, 2016

Add the RMSE(Root Mean Squared Error) option to the cross_val_score. #6457

Closed

raghavrv mentioned this issue Apr 28, 2016

[RFC / SLEP?] Enhance scorer (objective) interface and deprecate the greater_is_better attr. #6731

Closed

amueller added this to the 0.18 milestone Aug 2, 2016

amueller added Bug API labels Aug 2, 2016

amueller added the Need Contributor label Aug 2, 2016

betatim mentioned this issue Aug 27, 2016

[MRG + 2] Rename scorers like mse to neg_mse #7261

Merged

4 tasks

GaelVaroquaux closed this as completed in #7261 Sep 8, 2016

nygeog mentioned this issue Sep 15, 2017

Review Cross Validation Scores nygeog/osm_walkability#9

Open

scikit-learn locked as resolved and limited conversation to collaborators Dec 17, 2019

MSE is negative when returned by cross_val_score #2439

MSE is negative when returned by cross_val_score #2439

Comments

tdomhan commented Sep 12, 2013

jaquesgrobler commented Sep 12, 2013

ogrisel commented Sep 12, 2013

ogrisel commented Sep 12, 2013

tdomhan commented Sep 12, 2013

larsmans commented Sep 13, 2013

larsmans commented Sep 13, 2013

jaquesgrobler commented Sep 13, 2013

tdomhan commented Sep 28, 2013

amelio-vazquez-reina commented Oct 23, 2013

larsmans commented Oct 23, 2013

jnothman commented Jan 17, 2014

larsmans commented Jan 17, 2014

mblondel commented Feb 4, 2014

GaelVaroquaux commented Feb 4, 2014

mblondel commented Feb 4, 2014

mblondel commented Feb 4, 2014

GaelVaroquaux commented Feb 4, 2014

mblondel commented Feb 4, 2014

GaelVaroquaux commented Feb 4, 2014

mblondel commented Feb 4, 2014

GaelVaroquaux commented Feb 4, 2014

mblondel commented Feb 4, 2014

GaelVaroquaux commented Feb 4, 2014

jnothman commented Feb 4, 2014

Huitzilo commented May 20, 2015

amueller commented May 20, 2015

amueller commented May 20, 2015

mblondel commented Jun 4, 2015

mblondel commented Jun 4, 2015

larsmans commented Jun 4, 2015

amueller commented Jun 4, 2015

mblondel commented Jun 4, 2015

mblondel commented Jun 5, 2015

amueller commented Jun 5, 2015

jenifferYingyiWu commented Mar 15, 2016

amueller commented Aug 2, 2016

agramfort commented Aug 2, 2016 via email

raghavrv commented Aug 2, 2016

ogrisel commented Aug 27, 2016

shreyassks commented Oct 29, 2018

jolespin commented May 14, 2019

TomMeowMeow commented Jun 3, 2019

ghost commented Oct 6, 2019

pritishban commented Dec 17, 2019

pritishban commented Dec 17, 2019

amueller commented Dec 17, 2019