New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120
FEA add TunedThresholdClassifier meta-estimator to post-tune the cut-off threshold #26120
Conversation
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
…_classifier_again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look into the API
if curve_scorer is roc_curve or ( | ||
isinstance(curve_scorer, _BaseScorer) and curve_scorer._score_func is roc_curve | ||
): | ||
fpr, tpr, potential_thresholds = curve_scorer( | ||
classifier, X_val, y_val, **score_params_val | ||
) | ||
# For fpr=0/tpr=0, the threshold is set to `np.inf`. We need to remove it. | ||
fpr, tpr, potential_thresholds = fpr[1:], tpr[1:], potential_thresholds[1:] | ||
# thresholds are in decreasing order | ||
return potential_thresholds[::-1], ((1 - fpr)[::-1], tpr[::-1]) | ||
elif curve_scorer is precision_recall_curve or ( | ||
isinstance(curve_scorer, _BaseScorer) | ||
and curve_scorer._score_func is precision_recall_curve | ||
): | ||
precision, recall, potential_thresholds = curve_scorer( | ||
classifier, X_val, y_val, **score_params_val | ||
) | ||
# thresholds are in increasing order | ||
# the last element of the precision and recall is not associated with any | ||
# threshold and should be discarded | ||
return potential_thresholds, (precision[:-1], recall[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have these in two separate CurveScorer
s and not have this specific code here. It's looking odd to have this very specific treatment of the score functions here. This method shouldn't be caring about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have these in two separate CurveScorers and not have this specific code here.
I'm not sure to understand exactly what you want. Here, we call make_scorer
on these parametric curve (not _CurveScorer
) and since the API for the two functions is not consistent, we need to have this specific treatment.
Could you provide a small pseudo code on your vision, maybe I could understand better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is to have a new CureScorerMixin
and have roc_auc
and precision_recall_curve
to be one such scorer as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier I also think that there's room for improvement in the Scorers design to simplify this, but it is the kind of internal details that I think can be sorted out in a follow-up PR because it doesn't impact the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we had this exact discussion with @jeremiedbb as he mentioned we thought about not addressing it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the kind of technical debt that we don't tend to pay immediately after, that's why I'm worried. Could you please open an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for creating an issue, after merging it because it's hard to point to code that does not exist yet 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can point to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I open the following issue to be edited in the future: #28941
objective_metric : {"max_tpr_at_tnr_constraint", "max_tnr_at_tpr_constraint", \ | ||
"max_precision_at_recall_constraint, "max_recall_at_precision_constraint"} \ | ||
, str, dict or callable, default="balanced_accuracy" | ||
The objective metric to be optimized. Can be one of: | ||
|
||
* a string associated to a scoring function for binary classification | ||
(see model evaluation documentation); | ||
* a scorer callable object created with :func:`~sklearn.metrics.make_scorer`; | ||
* `"max_tnr_at_tpr_constraint"`: find the decision threshold for a true | ||
positive ratio (TPR) of `constraint_value`; | ||
* `"max_tpr_at_tnr_constraint"`: find the decision threshold for a true | ||
negative ratio (TNR) of `constraint_value`. | ||
* `"max_precision_at_recall_constraint"`: find the decision threshold for a | ||
recall of `constraint_value`; | ||
* `"max_recall_at_precision_constraint"`: find the decision threshold for a | ||
precision of `constraint_value`. | ||
|
||
constraint_value : float, default=None | ||
The value associated with the `objective_metric` metric for which we | ||
want to find the decision threshold when `objective_metric` is either | ||
`"max_tnr_at_tpr_constraint"`, `"max_tpr_at_tnr_constraint"`, | ||
`"max_precision_at_recall_constraint"`, or | ||
`"max_recall_at_precision_constraint"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a sense reinventing what in GridSearch we already have as a callable refit
. I find the API here quite odd. It's also not making the implementation here clean.
I think it would make more sense to accept a callable here that takes the cv results and gives you what needs to be used, then we have 4 pre-defined callables which users and us internally can use, and then we might consider allowing a shortcut for using those callables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it makes the code cleaner but not necessarily more accessible for end user. Having to provide a string an a single value is a low entry bar. Provide a callable (even a pre-defined one) mean that people would need to understand what this thing is doing and the discoverability is also more complex.
However, this looks again the dilemma between the score provided as a string or as a scorer. So maybe the current API will bite us in the future. At least here, we should not need to extend the options.
@jeremiedbb @ogrisel what are you thoughts on the API. Do you see alternative to the what is discussed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar concern, assuming that users will need more flexibility, e.g. constraint on different metrics. I also don't find the current 4 strings ideal but I understand that they probably represent 90%+ of the use cases, so I think it's acceptable as a start. We can add the option for a callable later if we receive requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I think users should do:
clf = TunedThreasholdClassifierCV(
estimator=LogisticRegression(),
objective=MaxTprAtTnrConstraint(tnr=0.3),
...
)
And we expose those callable classes to users, and document clearly how users can write one of their own.
This also makes the implementation here in this class much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it's worth special casing the precision/recall and TNR/TPR tradeoffs with the string / float value combo as:
- it makes the very common cases easy to implement without having to write custom filtering logic,
- the intent of an expression such as
TunedThresholdClassifier(clf, objective_metric=max_precision_at_recall_constraint, constraint_value=0.3)
more easily understandable in code than something like:
def max_precision_at_recall(results_at_thresholds, recall=0.3):
# complex user written code here
...
return selected_threshold_idx
TunedThresholdClassifier(
clf, objective_metric=["precision", "recall"], constraint=max_precision_at_recall
)
So I am +1 for keeping it this way in this first iteration and explore adding more flexibility to custom filtering logic in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't mind either way, but I consider both of them better than what we have now. Historically, we haven't had a good experience passing metric name and then metric parameters to the estimators. We've had things like Estimator(..., metric=..., p=.., metric_params=...)
and I prefer the metric
passed to have all that information since that's where that information belongs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I'd prefer to have metrics as objects as well, but everytime the discussion came up, there was a strong opposition against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm not happy with going forward with this as is, is that I don't want us to release and then immediately deprecated constraint_value
, which I think doesn't belong to the estimator and it belongs to the constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have many estimators with coupled parameters. On the opposite we almost never have parameters with a complex type. So in fact this would make it a special case in our API. I also think a motivation for keeping separate parameters is to make parameter search easier.
I don't think we'll need to deprecate constraint_value
right away. We can keep the combination of objective_metric=<str> + constraint_value=<value>
, easy to use and for most use cases, and add support for objective_metric=callable
for advanced use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware that we have the exact the same issue with pos_label
(@glemaitre shooting himself a bullet in his on foot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
"max_recall_at_precision_constraint", | ||
} | ||
), | ||
StrOptions(set(get_scorer_names())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should limit to classification metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm not sure how we should handle that. We don't have any public mechanism available (only present in the test). I recall to have open such a useful PR at some point in time: #17930
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's figure this out separately then :)
objective_metric : {"max_tpr_at_tnr_constraint", "max_tnr_at_tpr_constraint", \ | ||
"max_precision_at_recall_constraint, "max_recall_at_precision_constraint"} \ | ||
, str, dict or callable, default="balanced_accuracy" | ||
objective_metric : str, dict or callable, default="balanced_accuracy" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be called scoring
the same as *SeachCV
? We're optimizing using a cv object optimizing for one or more metrics, exactly like *SearchCV
, would be nice to have a consistent name for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it makes the API more consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to find where we got the discussion in the past but I could not find it.
I think that the fact that we had constraint metric was different enough from the "scoring" parameter that we decided to change the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty optimistic that we can find an API which we can share with *SearchCV
which is both user friendly and capable of handling usecases we want. And as @jeremiedbb says in #26120 (review) , we're not far from nicely supporting it. So I'd say we can safely rename the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I have a question: do we need pos_label
and response_method
parameters. Those two can be delegated to make_scorer
: response_method
is a parameter of this function while pos_label
is a parameter of the function provided to make_scorer
and forwarded to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside is coming to "how many people are using make_scorer
the people who don't are fine with the defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if I want to tune a LogisticRegression with a grid search and recall as scoring, either I pass scoring="recall"
and rely on the defaults, or I pass a scorer made out of recall_score
if I want to change the pos label. I think it's fine to mimic this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they will be requested for the FixedThresholdClassifier because we don't have any metric.
For this one we need to keep them indeed I think because if a user passes a threshold of say 0.3, we need to know if it's for a proba or for a decision function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the response_method of the estimator and the response_method for the scorer are not the same, so we need to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. response_method
is dispatched to _CurveScorer
and not the scorer. It would be weird to request our user passing make_scorer(balanced_accuracy_score, response_method="predict_proba")
while the scorer alone does not make any sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look and references to constrained metrics seem to have all been removed.
Also, I think the use case of constrained metric is still doable with the current API. I expect that something like the following works:
def max_tpr_at_tnr_constraint_score(y_true, y_pred, max_tnr):
tpr = tpr(y_true, y_pred)
tnr = tnr(y_true, y_pred)
if tnr > max_tnr:
return -np.inf
return tpr
my_scorer = make_scorer(max_tpr_at_tnr_constraint_score, max_tnr=0.5)
I think that this is good for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Deja-vu :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you please open separate issues for remaining work? (some refactoring and some doc concerns seem to be open).
🎉 |
The issue about refactoring is already open here: #28941. For the documentation, I'll open one issue to see how to articulate the constrained part. The comment raised by @amueller is not anymore meaningful since we don't really allow to choose a point on the PR or ROC curve in a straight forward manner. I'll also revive #17930 |
🚀 @glemaitre 🎉 In a way, I like the explicit |
Maybe but that would entail adding at least 2 new constructor parameters to all classifiers in scikit-learn that implement |
superseded #16525
closes #16525
closes #8614
closes #10117
supersedes #10117
build upon #26037
relates to #4813
Summary
We introduce a
TunedThresholdClassifier
that intends to post-tune the cut-off points to convert a soft decision of thedecision_function
orpredict_proba
to a hard decision provided bypredict
.Important features to have in mind:
objective_metric
: the objective metric is set to either a metric to be maximized or a pair of metrics, one to be optimized under the constraint of the other (to find a trade-off). Additionally, we can pass a cost/gain-matrix that could be used to optimize a business metric. For this case, we are limited to constant cost/gain. In the future, we can think of cost/gain that depends on the matrixX
but we would need to be able to forward meta-data to the scorer (a good additional use case for SLEP006 @adrinjalali).cv
andrefit
: we provide some flexibility to pass refitted model and single train/test split. We add limitations and documentation to caveats with an example.Point to discuss
TunedThresholdClassifier
? Shall instead have something about "threshold" (e.g.ThresholdTuner
)?objective_metric
,constraint_value
andobjective_score
. Is the naming fine? An alternative to "objective" might be "utility"?Further work
I implemented currently a single example that shows the feature in the context of post-tuning of the decision threshold.
The current example is using a single train/test split for the figure and I think it would be nice to have some ROC/precision-recall curve obtained from cross-validation to be complete. However, we need some new features to be implemented.
I also am planning to analyse the usage of this feature on the problem of calibration on imbalanced classification problems. The feeling on this topic is that resampling strategies involved an implicit tuning of the decision threshold at the cost of a badly calibrated model. It might be better to learn a model on the imbalanced problem directly, making sure that it is well calibrated and then post-tune the decision threshold for "hard" prediction. In this case, you get the best of two worlds: a calibrated model if the output of
predict_proba
is important to you and an optimum hard predictor for your specific utility metric. However, this is going to need some investigation and will be better suited for another PR.