Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120

Merged
merged 228 commits into from May 3, 2024

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Apr 7, 2023

superseded #16525
closes #16525
closes #8614
closes #10117
supersedes #10117

build upon #26037

relates to #4813

Summary

We introduce a TunedThresholdClassifier that intends to post-tune the cut-off points to convert a soft decision of the decision_function or predict_proba to a hard decision provided by predict.

Important features to have in mind:

objective_metric: the objective metric is set to either a metric to be maximized or a pair of metrics, one to be optimized under the constraint of the other (to find a trade-off). Additionally, we can pass a cost/gain-matrix that could be used to optimize a business metric. For this case, we are limited to constant cost/gain. In the future, we can think of cost/gain that depends on the matrix X but we would need to be able to forward meta-data to the scorer (a good additional use case for SLEP006 @adrinjalali).

cv and refit: we provide some flexibility to pass refitted model and single train/test split. We add limitations and documentation to caveats with an example.

Point to discuss

  • Are we fine with the name TunedThresholdClassifier? Shall instead have something about "threshold" (e.g. ThresholdTuner)?
  • We are using the term objective_metric, constraint_value and objective_score. Is the naming fine? An alternative to "objective" might be "utility"?

Further work

I implemented currently a single example that shows the feature in the context of post-tuning of the decision threshold.

The current example is using a single train/test split for the figure and I think it would be nice to have some ROC/precision-recall curve obtained from cross-validation to be complete. However, we need some new features to be implemented.

I also am planning to analyse the usage of this feature on the problem of calibration on imbalanced classification problems. The feeling on this topic is that resampling strategies involved an implicit tuning of the decision threshold at the cost of a badly calibrated model. It might be better to learn a model on the imbalanced problem directly, making sure that it is well calibrated and then post-tune the decision threshold for "hard" prediction. In this case, you get the best of two worlds: a calibrated model if the output of predict_proba is important to you and an optimum hard predictor for your specific utility metric. However, this is going to need some investigation and will be better suited for another PR.

@glemaitre glemaitre changed the title Cutoff classifier again FEA add CutOffClassifier meta-estimator to post-tune the decision threshold Apr 7, 2023
@glemaitre glemaitre marked this pull request as draft April 7, 2023 12:38
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look into the API

Comment on lines 614 to 634
if curve_scorer is roc_curve or (
isinstance(curve_scorer, _BaseScorer) and curve_scorer._score_func is roc_curve
):
fpr, tpr, potential_thresholds = curve_scorer(
classifier, X_val, y_val, **score_params_val
)
# For fpr=0/tpr=0, the threshold is set to `np.inf`. We need to remove it.
fpr, tpr, potential_thresholds = fpr[1:], tpr[1:], potential_thresholds[1:]
# thresholds are in decreasing order
return potential_thresholds[::-1], ((1 - fpr)[::-1], tpr[::-1])
elif curve_scorer is precision_recall_curve or (
isinstance(curve_scorer, _BaseScorer)
and curve_scorer._score_func is precision_recall_curve
):
precision, recall, potential_thresholds = curve_scorer(
classifier, X_val, y_val, **score_params_val
)
# thresholds are in increasing order
# the last element of the precision and recall is not associated with any
# threshold and should be discarded
return potential_thresholds, (precision[:-1], recall[:-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have these in two separate CurveScorers and not have this specific code here. It's looking odd to have this very specific treatment of the score functions here. This method shouldn't be caring about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have these in two separate CurveScorers and not have this specific code here.

I'm not sure to understand exactly what you want. Here, we call make_scorer on these parametric curve (not _CurveScorer) and since the API for the two functions is not consistent, we need to have this specific treatment.

Could you provide a small pseudo code on your vision, maybe I could understand better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is to have a new CureScorerMixin and have roc_auc and precision_recall_curve to be one such scorer as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier I also think that there's room for improvement in the Scorers design to simplify this, but it is the kind of internal details that I think can be sorted out in a follow-up PR because it doesn't impact the public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we had this exact discussion with @jeremiedbb as he mentioned we thought about not addressing it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the kind of technical debt that we don't tend to pay immediately after, that's why I'm worried. Could you please open an issue for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for creating an issue, after merging it because it's hard to point to code that does not exist yet 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can point to this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I open the following issue to be edited in the future: #28941

Comment on lines 689 to 711
objective_metric : {"max_tpr_at_tnr_constraint", "max_tnr_at_tpr_constraint", \
"max_precision_at_recall_constraint, "max_recall_at_precision_constraint"} \
, str, dict or callable, default="balanced_accuracy"
The objective metric to be optimized. Can be one of:

* a string associated to a scoring function for binary classification
(see model evaluation documentation);
* a scorer callable object created with :func:`~sklearn.metrics.make_scorer`;
* `"max_tnr_at_tpr_constraint"`: find the decision threshold for a true
positive ratio (TPR) of `constraint_value`;
* `"max_tpr_at_tnr_constraint"`: find the decision threshold for a true
negative ratio (TNR) of `constraint_value`.
* `"max_precision_at_recall_constraint"`: find the decision threshold for a
recall of `constraint_value`;
* `"max_recall_at_precision_constraint"`: find the decision threshold for a
precision of `constraint_value`.

constraint_value : float, default=None
The value associated with the `objective_metric` metric for which we
want to find the decision threshold when `objective_metric` is either
`"max_tnr_at_tpr_constraint"`, `"max_tpr_at_tnr_constraint"`,
`"max_precision_at_recall_constraint"`, or
`"max_recall_at_precision_constraint"`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a sense reinventing what in GridSearch we already have as a callable refit. I find the API here quite odd. It's also not making the implementation here clean.

I think it would make more sense to accept a callable here that takes the cv results and gives you what needs to be used, then we have 4 pre-defined callables which users and us internally can use, and then we might consider allowing a shortcut for using those callables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it makes the code cleaner but not necessarily more accessible for end user. Having to provide a string an a single value is a low entry bar. Provide a callable (even a pre-defined one) mean that people would need to understand what this thing is doing and the discoverability is also more complex.

However, this looks again the dilemma between the score provided as a string or as a scorer. So maybe the current API will bite us in the future. At least here, we should not need to extend the options.

@jeremiedbb @ogrisel what are you thoughts on the API. Do you see alternative to the what is discussed here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a similar concern, assuming that users will need more flexibility, e.g. constraint on different metrics. I also don't find the current 4 strings ideal but I understand that they probably represent 90%+ of the use cases, so I think it's acceptable as a start. We can add the option for a callable later if we receive requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I think users should do:

clf = TunedThreasholdClassifierCV(
    estimator=LogisticRegression(),
    objective=MaxTprAtTnrConstraint(tnr=0.3),
    ...
)

And we expose those callable classes to users, and document clearly how users can write one of their own.

This also makes the implementation here in this class much cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it's worth special casing the precision/recall and TNR/TPR tradeoffs with the string / float value combo as:

  • it makes the very common cases easy to implement without having to write custom filtering logic,
  • the intent of an expression such as TunedThresholdClassifier(clf, objective_metric=max_precision_at_recall_constraint, constraint_value=0.3) more easily understandable in code than something like:
def max_precision_at_recall(results_at_thresholds, recall=0.3):
   # complex user written code here
   ...
   return selected_threshold_idx

TunedThresholdClassifier(
    clf, objective_metric=["precision", "recall"], constraint=max_precision_at_recall
)

So I am +1 for keeping it this way in this first iteration and explore adding more flexibility to custom filtering logic in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't mind either way, but I consider both of them better than what we have now. Historically, we haven't had a good experience passing metric name and then metric parameters to the estimators. We've had things like Estimator(..., metric=..., p=.., metric_params=...) and I prefer the metric passed to have all that information since that's where that information belongs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'd prefer to have metrics as objects as well, but everytime the discussion came up, there was a strong opposition against it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm not happy with going forward with this as is, is that I don't want us to release and then immediately deprecated constraint_value, which I think doesn't belong to the estimator and it belongs to the constraint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have many estimators with coupled parameters. On the opposite we almost never have parameters with a complex type. So in fact this would make it a special case in our API. I also think a motivation for keeping separate parameters is to make parameter search easier.

I don't think we'll need to deprecate constraint_value right away. We can keep the combination of objective_metric=<str> + constraint_value=<value>, easy to use and for most use cases, and add support for objective_metric=callable for advanced use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that we have the exact the same issue with pos_label (@glemaitre shooting himself a bullet in his on foot)

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more feedback.

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
examples/model_selection/plot_tuned_decision_threshold.py Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
"max_recall_at_precision_constraint",
}
),
StrOptions(set(get_scorer_names())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should limit to classification metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm not sure how we should handle that. We don't have any public mechanism available (only present in the test). I recall to have open such a useful PR at some point in time: #17930

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's figure this out separately then :)

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved
objective_metric : {"max_tpr_at_tnr_constraint", "max_tnr_at_tpr_constraint", \
"max_precision_at_recall_constraint, "max_recall_at_precision_constraint"} \
, str, dict or callable, default="balanced_accuracy"
objective_metric : str, dict or callable, default="balanced_accuracy"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be called scoring the same as *SeachCV? We're optimizing using a cv object optimizing for one or more metrics, exactly like *SearchCV, would be nice to have a consistent name for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it makes the API more consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to find where we got the discussion in the past but I could not find it.
I think that the fact that we had constraint metric was different enough from the "scoring" parameter that we decided to change the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty optimistic that we can find an API which we can share with *SearchCV which is both user friendly and capable of handling usecases we want. And as @jeremiedbb says in #26120 (review) , we're not far from nicely supporting it. So I'd say we can safely rename the parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I have a question: do we need pos_label and response_method parameters. Those two can be delegated to make_scorer: response_method is a parameter of this function while pos_label is a parameter of the function provided to make_scorer and forwarded to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside is coming to "how many people are using make_scorer

the people who don't are fine with the defaults

Copy link
Member

@jeremiedbb jeremiedbb May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if I want to tune a LogisticRegression with a grid search and recall as scoring, either I pass scoring="recall" and rely on the defaults, or I pass a scorer made out of recall_score if I want to change the pos label. I think it's fine to mimic this behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they will be requested for the FixedThresholdClassifier because we don't have any metric.

For this one we need to keep them indeed I think because if a user passes a threshold of say 0.3, we need to know if it's for a proba or for a decision function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the response_method of the estimator and the response_method for the scorer are not the same, so we need to keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. response_method is dispatched to _CurveScorer and not the scorer. It would be weird to request our user passing make_scorer(balanced_accuracy_score, response_method="predict_proba") while the scorer alone does not make any sense.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look and references to constrained metrics seem to have all been removed.

Also, I think the use case of constrained metric is still doable with the current API. I expect that something like the following works:

def max_tpr_at_tnr_constraint_score(y_true, y_pred, max_tnr):
    tpr = tpr(y_true, y_pred)
    tnr = tnr(y_true, y_pred)
    if tnr > max_tnr:
        return -np.inf
    return tpr

my_scorer = make_scorer(max_tpr_at_tnr_constraint_score, max_tnr=0.5)

doc/modules/classification_threshold.rst Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member Author

I think that this is good for another round of review.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre
Copy link
Member Author

LGTM

Deja-vu :)

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you please open separate issues for remaining work? (some refactoring and some doc concerns seem to be open).

@adrinjalali adrinjalali enabled auto-merge (squash) May 3, 2024 16:11
@adrinjalali adrinjalali merged commit 1e49c34 into scikit-learn:main May 3, 2024
28 checks passed
@jeremiedbb
Copy link
Member

🎉

@glemaitre
Copy link
Member Author

The issue about refactoring is already open here: #28941.

For the documentation, I'll open one issue to see how to articulate the constrained part. The comment raised by @amueller is not anymore meaningful since we don't really allow to choose a point on the PR or ROC curve in a straight forward manner.

I'll also revive #17930

@lorentzenchr
Copy link
Member

🚀 @glemaitre 🎉
Thanks for this great addition many have been longing for.

In a way, I like the explicit FixedThresholdClassifier. Out of curiosity: is there a path forward that this could end up as mixing class in every classifier? Without the need to meta-class-wrap it by the user?

@ogrisel
Copy link
Member

ogrisel commented May 6, 2024

Out of curiosity: is there a path forward that this could end up as mixing class in every classifier? Without the need to meta-class-wrap it by the user?

Maybe but that would entail adding at least 2 new constructor parameters to all classifiers in scikit-learn that implement predict_proba or decision_function. That's kind of an invasive API change...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add wrapper class that changes threshold value for predict