Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For some objectives where baseline was 0, "pct better than baseline" is nan #1449

Closed
rpeck opened this issue Nov 20, 2020 · 9 comments · Fixed by #1809
Closed

For some objectives where baseline was 0, "pct better than baseline" is nan #1449

rpeck opened this issue Nov 20, 2020 · 9 comments · Fixed by #1809
Assignees
Labels
enhancement An improvement to an existing feature.

Comments

@rpeck
Copy link

rpeck commented Nov 20, 2020

{'F1': nan,
 'MCC Binary': nan,
 'Log Loss Binary': 93.29789549298991,
 'AUC': 58.36492736629537,
 'Precision': nan,
 'Balanced Accuracy Binary': 63.46659876071641,
 'Accuracy Binary': 12.876088314169193}

I've created a Jupyter notebook that reproduces this problem in evalml, and attached it and the associated datafile to a thread in Slack.

@rpeck rpeck added the blocker An issue blocking a release. label Nov 20, 2020
@rpeck rpeck pinned this issue Nov 20, 2020
@rpeck rpeck unpinned this issue Nov 20, 2020
@dsherry dsherry added bug Issues tracking problems with existing features. and removed blocker An issue blocking a release. labels Nov 20, 2020
@dsherry dsherry changed the title we're sometimes getting nans for metrics in percent_better_than_baseline_all_objectives Nans in percent_better_than_baseline_all_objectives Nov 20, 2020
@dsherry
Copy link
Contributor

dsherry commented Nov 20, 2020

Reproducer

import evalml
import pandas as pd
X = pd.read_csv('~/Downloads/fraud_500_data.csv').drop(['id', 'expiration_date'], axis=1)
y = X.pop('fraud')
automl = evalml.automl.AutoMLSearch(problem_type="binary", objective="f1")
automl.search(X, y)
# note that all percent_better_than_baseline values are nan in the rankings table
print(automl.rankings)
# can also check the scores of any pipeline other than the baseline pipeline, which should have id 0
print(automl.results['pipeline_results'][1]['percent_better_than_baseline_all_objectives'])

Dataset is here

@freddyaboulton
Copy link
Contributor

@dsherry @rpeck This is expected behavior because the baseline pipeline gets a score of 0 on the objectives with NaN (F1, MCCBinary, Precision). There have been discussions about setting division by 0 to be either infinity or None in this method but we've never decided those are better than NaN because if the baseline scores the worst possible score on any objective, then comparing "percent better" on that objective doesn't do much good and that can be conveyed with None, NaN, or infinity.

That being said, there may be other reasons to pick one of these options over NaN!

@rpeck
Copy link
Author

rpeck commented Nov 20, 2020

@freddyaboulton Ah, makes sense! I'll change the test to skip over any objective where the baseline is 0. Thanks!

@dsherry dsherry added enhancement An improvement to an existing feature. and removed bug Issues tracking problems with existing features. labels Nov 20, 2020
@dsherry dsherry changed the title Nans in percent_better_than_baseline_all_objectives For some objectives, percent_better_than_baseline_all_objectives is nan if baseline was 0 Nov 20, 2020
@dsherry dsherry changed the title For some objectives, percent_better_than_baseline_all_objectives is nan if baseline was 0 For some objectives where baseline was 0, "pct better than baseline" is nan Nov 20, 2020
@dsherry
Copy link
Contributor

dsherry commented Nov 20, 2020

Thank you @freddyaboulton ! @rpeck sorry I didn't catch this when you were asking me about it yesterday.

Leaving this issue open to discuss: should we change the behavior in this case?

@freddyaboulton so F1, MCCBinary and Precision are all metrics where greater is better and are bounded in the range [-1, 1] (corr) or [0, 1]. Could we alter the pct improvement impl to compute the absolute difference from 0 and use that as the pct improvement? And if that's what we're doing currently, I wouldn't expect a baseline of 0 to produce nan pct improvement for those metrics.

@freddyaboulton
Copy link
Contributor

freddyaboulton commented Nov 20, 2020

@dsherry We proposed computing absolute difference for objectives bounded by [0, 1] in the design phase but we decided having two different computations would be confusing. That being said, we should maybe reconsider that given that the baseline pipeline is almost designed to score 0 on those objectives lol. Worth noting that when we first made that decision, we were only computing the percent better for the primary objective (which is not one of these bounded objectives except for regression).

Even if we go compute absolute difference, we may want to consider changing the Nan/None/inf division-by-0 behavior. One interesting case to consider is R2,since in most cases it's [0, 1] but it's technically (-inf, 1]. So computing absolute difference may not be mathematically sound but since it's the default objective for regression, we should expect to see lots of baselines scoring 0.

@dsherry dsherry added this to the Sprint 2021 Jan B milestone Jan 21, 2021
@freddyaboulton
Copy link
Contributor

So to summarize, there are two independent changes we can make, leading to four possible outcomes:

  1. Do not compute absolute difference for objectives bounded in [0, 1], division by 0 is Nan. Current behavior.
  2. Do not compute absolute difference for objectives bounded in [0, 1], division by 0 is inf.
  3. Compute absolute differences for objectives bounded in [0, 1], division by 0 is Nan.
  4. Compute absolute differences for objectives bounded in [0, 1], division by 0 is inf.

Although I prefer returning NaN when we divide by 0, the gut reaction of users when they see NaN has been to suppose something broke in automl. I think returning inf would make it clearer that nothing broke and that the pipeline is in fact better than the baseline.

That leaves options 2 and 4.

I think having two different computations for "percent better" will make it harder to communicate to users what's actually being computed for each pipeline. That being said, our baseline pipelines are designed to score 0 for a lot of objectives (R2, F1, MCC) especially in imbalanced problems (we just predict the mode). That makes the "percent better" feature not very useful for most realistic problems since all pipelines will be "infinitely" better than the baseline.

I think I'm leaning 55% for option 4 and 45% for option 2 but I'd like to hear other viewpoints before making that change!

@freddyaboulton freddyaboulton self-assigned this Feb 2, 2021
@dsherry
Copy link
Contributor

dsherry commented Feb 4, 2021

In standup today we decided its time to update the "pct better than baseline" behavior. We're going with options 2 and 4 above:

  • Use relative difference for objectives without bounds (MSE, log loss, etc)
  • Use absolute difference for objectives with [0, 1] bounds (AUC, R2, etc)
  • We'll have to handle edge cases like Pearson correlation ([-1, 1])
  • Return inf rather than nan if there's a divide-by-0 error

@freddyaboulton does this match what we discussed?

@rpeck
Copy link
Author

rpeck commented Feb 8, 2021

Like. :-)

@rpeck
Copy link
Author

rpeck commented Feb 8, 2021

Further: I agree with the decision. IMO, if a metric is [usually, at least] 0..1, then going from 0 to 0.2 feels like a 20% improvement, even though mathematically it isn't. In a way, this reminds me of all of those formulas that take the log of a quantity, but they add 1 first so that they don't take the log of 0. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants