Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

Open
2 tasks done
mattbit opened this issue Jun 9, 2023 · 3 comments
Open
2 tasks done
Assignees
Labels
scan Created by Linear-GitHub Sync

Comments

@mattbit
Copy link
Member

mattbit commented Jun 9, 2023

Following the feedback by user KD_A on reddit. They recommend more sound handling of statistical significance to prevent selection bias, in particular using a Benjamini-Hochberg procedure to control the false discovery rate.

The problem is that we currently test several data slice candidates + metric without accounting for selection bias → this can lead to a high number of false positive detections.

To do

  • Add simple stat tests to the current implementation and measure the significance on the test models already in pytest fixtures → do we have detections with high p-values?
  • If we do, check if we can set a FPR parameter in PerformanceBiasDetector and filter the detections based on their p-value with Benjamini-Hochberg procedure.

From SyncLinear.com | GSK-1279

@mattbit mattbit self-assigned this Jun 9, 2023
@mattbit mattbit added the scan Created by Linear-GitHub Sync label Jun 9, 2023
@mattbit
Copy link
Member Author

mattbit commented Jun 23, 2023

This was mostly adressed in #1193, althought the Benjamini–Hochberg procedure is not enabled by default (because statistical tests on metrics like balanced accuracy pose problems).

@mattbit mattbit reopened this Oct 18, 2023
@mattbit
Copy link
Member Author

mattbit commented Oct 18, 2023

Not completed yet

@kddubey
Copy link

kddubey commented Apr 30, 2024

Hello,

It's KD_A from Reddit. I purged my account recently, so the linked Reddit comment is no longer available. Posting it and the next reply here for posterity:

First reply

Thanks for the response.

I realized I misphrased the problem as multiple testing. It's more accurate to categorize it as selection bias: if 100s of slice+metric combinations are examined, then the observed worst n drops from the global average (where n is kind of small) are likely overestimates. The degree of overestimation gets worse as the rank of the drop gets closer to 1. See the intro of this paper (which also contains a bias-corrected estimator):

Efron, Bradley. "Tweedie’s formula and selection bias." Journal of the American Statistical Association 106.496 (2011): 1602-1614.

Given this fact, my main concern as a user would be how much I should trust the alerts. Have Giskard's alerts and estimates been empirically evaluated? For example, for alerts, what's the probability that a drop is practically significant/worrisome given that Giskard alerted on it? One way to answer this question is to split off another large test set, and evaluate Giskard's alerts (from an independent test set) on it.

the statistical significance is always pretty high

2 potential concerns:

  1. In making this determination, were p-values examined before or after correcting for multiple comparisons? Correction methods can greatly increase p-values; they can turn many significant findings into insignificant ones. So it'd be important to make this determination after correction.

  2. When running these tests, was the null value the global average, and the alternative hypothesis that the drop is less than the global average? This may not be the right test to run if the user only cares about slices where the drop is "practically significantly" worse. For example, for a global accuracy of 0.78, it's reasonable that a user only cares about a drop which is at least 0.28 b/c that's worse than 0.5 accuracy. Testing for slice accuracy < 0.5 will result in much higher p-values than testing for slice accuracy < 0.78.

I'm not advocating for displaying hypothesis test results to users. But I do think that running good testing procedures in the background will help in filtering out false alerts.

When I started working on this, I thought measuring significance (and thus handling multiple comparison) would be a major concern and started looking in things like alpha spending/alpha investing to control false positives.

In case you end up going down this route again, the Benjamini-Hochberg procedure is a super easy and fast way to control the false discovery/alert rate. It seems more applicable to Giskard than sequential correction procedures.

Second reply

If you have better recommendations on how to improve this while keeping it simple, I’m definitely interested.

A test for relative difference in (mean) score could work. Assuming higher scores are better:

H0: (complement score - slice score)/(complement score) = 1/5

H1: (complement score - slice score)/(complement score) > 1/5

The null value, 1/5, was chosen assuming that the user only cares about differences where the model performs 80% as well (or worse) on the selected slice as it does on the complement. Feel free to decrease it to e.g., 1/10, b/c there's some tolerance for false positives.

Avoid worrying about analytically computing the distribution of the test statistic by running a permutation test. All you have to do is supply a function which computes the relative difference in means as the statistic to scipy.stats.permutation_test. Here's an example I just wrote for the accuracy metric.

Everything else you mentioned makes sense. Thank you for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scan Created by Linear-GitHub Sync
Development

No branches or pull requests

3 participants