Hyperparameter search with threshold-dependent metrics #25321

vitaliset · 2023-01-07T00:29:31Z

vitaliset
Jan 7, 2023

When optimizing hyperparameters, threshold-dependent metrics make sklearn.model_selection.BaseSearchCV-like search methods use the estimator's .predict method instead of .predict_proba. This can be harmful as 0.5 might not be the best threshold, especially in imbalanced learning scenarios. I often see a 0 score f1_score, and the threshold is probably off by a lot.

What is the most scikit-learnable way to deal with it now? I have written a post about this and some of the ways I see this, but I would love to see others' opinions! :D

Also, as far as I can see, sklearn.model_selection.CutoffClassifier (from #16525) will solve this problem once you have a fixed model. But, for hyperopt, can we use this estimator to tune the base_estimator's params? From my understanding of sklearn API, it would not be possible because get_params and set_params would not look at the base_estimator.

Is there a way to dribble this? I can see this working if something like make_pipeline(estimator, CutoffClassifier) works, for instance. But I don't think it will in the way it is being designed (it will behave the way sklearn.calibration.CalibratedClassifierCV acts).

glemaitre · 2023-01-12T15:18:56Z

glemaitre
Jan 12, 2023
Maintainer

But, for hyperopt, can we use this estimator to tune the base_estimator's params?

It is a while since I did not look at this PR but you should be able to tune the parameter of the base estimator as well if you place the CutOffClassifier within a SearchCV.

3 replies

vitaliset Jan 12, 2023
Author

Thanks for the reply @glemaitre. :)

If I understand correctly, you are saying we will be able to do something like:

grid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=params, scoring='f1')
cutoffmodel = CutOffClassifier(base_estimator=grid, objective_metric='f1')
cutoffmodel.fit(X, y)

Even though it's pretty useful, this would make the SearchCV still use the 0.5 threshold, right?

glemaitre Jan 12, 2023
Maintainer

I was thinking a bit more:

grid = GridSearchCV(
    estimator=CutOffClassifier(RandomForestClassifier(), ...),
    param_grid=params,
    scoring='f1'
)
grid.fit(X, y)

In this case, you will have an internal estimator that is not at 0.5 and optimized on the training set.

vitaliset Jan 12, 2023
Author

Awesome! This solves the point I raised!

I'm looking forward to its release! If I can help in any way, please let me know! :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameter search with threshold-dependent metrics #25321

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Hyperparameter search with threshold-dependent metrics #25321

vitaliset Jan 7, 2023

Replies: 1 comment · 3 replies

glemaitre Jan 12, 2023 Maintainer

vitaliset Jan 12, 2023 Author

glemaitre Jan 12, 2023 Maintainer

vitaliset Jan 12, 2023 Author

vitaliset
Jan 7, 2023

Replies: 1 comment 3 replies

glemaitre
Jan 12, 2023
Maintainer

vitaliset Jan 12, 2023
Author

glemaitre Jan 12, 2023
Maintainer

vitaliset Jan 12, 2023
Author