Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Blog post on predict_proba #152

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

aperezlebel
Copy link

@aperezlebel aperezlebel commented Dec 2, 2022

Closes #147.

Work in progress.

TODO:

  • Update illustration image + credits
  • Update title
  • Check content

@aperezlebel aperezlebel mentioned this pull request Dec 2, 2022
These probability estimates are typically accessible from the `predict_proba` method of scikit-learn's classifiers.

However, the quality of the estimated probabilities must be validated to provide trustworthiness, ensure fairness and robustness to operating conditions.
To be reliable, the estimated probabilities must be close to the true underlying posterior probabilities of the classes `P(Y=1|X)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, there are several points of view on what such a probability may mean (controlled average error rate versus controlled individual probabilities). Maybe it would be good by first explaining what these two mean.

Similarly to validating a discriminant classifier through accuracy or ROC curves, tools have been developed to evaluate a probabilistic classifier.
Calibration is one of them [1-4]. Calibration is used as a proxy to evaluate the closeness of the estimated probabilities to the true ones. Many recalibration techniques have been developed to improve the estimated probabilities (see [scikit-learn's user guide on calibration](https://scikit-learn.org/stable/modules/calibration.html)). Estimated probabilities of a calibrated classifier can be interpreted as probability of correctness on population of same estimated probability, but not as the true posterior class probability.

Indeed, it is important to highlight that calibration only captures part of the error on the estimated probabilities. The remaining term is the grouping loss [5]. Together, the calibration and grouping losses fully characterize the error on the estimated probabilities, the epistemic loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are too formal. Give us the intuitions of why calibration is not the full story, rather than the maths.

Indeed, it is important to highlight that calibration only captures part of the error on the estimated probabilities. The remaining term is the grouping loss [5]. Together, the calibration and grouping losses fully characterize the error on the estimated probabilities, the epistemic loss.

$$\text{Epistemic loss} = \text{Calibration loss} + \text{Grouping loss}$$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First mention Brier score for model selection. Later mention grouping loss.


However, estimating the grouping loss is a harder problem than calibration as its estimation involves directly the true probabilities. Recent work have focused on approximating the grouping loss through local estimations of the true probabilities [6].

When working with scikit-learn's classifiers, users must be equally as cautious on results obtained from `predict_proba` as on results from `predict`. Both output estimated quantities (probabilities and labels respectively) with no prior guarantees on their quality. In both cases, model's quality must be assessed with appropriate metrics: expected calibration error, brier score, accuracy, AUC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put links to relevant pages in the scikit-learn documentation

However, estimating the grouping loss is a harder problem than calibration as its estimation involves directly the true probabilities. Recent work have focused on approximating the grouping loss through local estimations of the true probabilities [6].

When working with scikit-learn's classifiers, users must be equally as cautious on results obtained from `predict_proba` as on results from `predict`. Both output estimated quantities (probabilities and labels respectively) with no prior guarantees on their quality. In both cases, model's quality must be assessed with appropriate metrics: expected calibration error, brier score, accuracy, AUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention quickly (with a separate section title) recalibration (and link to corresponding docs)

@lorentzenchr
Copy link
Member

@aperezlebel While I value the effort for such a blog post, I do not agree with some parts of its current content, e.g. the grouping loss. Unfortunately, I can‘t promise a fast review atm.

@aperezlebel
Copy link
Author

@lorentzenchr I appreciate your feedback, thanks. Could you elaborate on the parts you disagree with?

@lorentzenchr
Copy link
Member

I have 3 main points of critique:

  1. The main message is unclear to me. If I fit a logistic regression, do I need to re-calibrated and how do I assess the calibration after all?
  2. As @GaelVaroquaux knows, I prefer the decomposition of proper scoring rules in terms of reliabilty (or miscalibration), resolution (or discrimination) and uncertainty (or entropy), according to Murphy's original decomposition of the Brier score and many others after him, e.g. Bröcker. I would prefer to avoid the term "grouping loss".
  3. You cite yourself, exclusively for one topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

predict_proba
3 participants