New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP sklearn compatible ThresholdOptimizer #282
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
|
||
def __init__(self, *, estimator=None, constraints=DEMOGRAPHIC_PARITY, | ||
grid_size=1000, flip=True, plot=False, random_state=None): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have lost unconstrained_predictor
and all of the related logic. Was this intentional? The idea is that if this argument is present, we should use it in fit
to initialize self._estimator
(without calling self.unconstrained_predictor.fit(...)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrinjalali can elaborate but from what I understand scikit-learn doesn't take already fitted models, but rather fits them itself. We can still have another version where we take an already fitted one, but it wouldn't be scikit-learn compatible. Is that about accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. GridSearchCV
clones the estimators before fitting them, and that clears all the state they have. If we want to allow the user to re-fit on using a pre-fitted estimator, we can implement a warm_start
strategy for the estimator to use the previously fitted one. But by default all of these should be fit from scratch, since grid search is fitting them on a new part of the data each time anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable. Would be sad to lose the existing functionality if we're planning to add it back in anyway... any chance we can add that warm start in this PR, or is that a major change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added warm_start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a delay on this, but I think that warm_star
is confusing in this context.
Just to state the obvious: the high-level semantics of ThresholdOptimizer
(after it is fitted) is really of a two-stage pipeline. The first step is implemented by self._estimator.predict()
. It takes as input X
and returns y
. The returned y
is then transformed into y'
(inside self.predict()
).
We would like to enable two styles of fitting this pipeline: (1) fitting both steps from the provided data; or (2) fitting only the second step from the provided data (and using the provided unconstrained_predictor
as the first step). Now if I think about a warm start, I would expect re-fitting a previously fitted pipeline, using the previous pipeline as a warm start. This to me is quite different from (2), which is just fitting the second stage of the pipeline (from scratch) while keeping the first stage of the pipeline untouched.
I'm not quite sure I understand the issue with the interaction between the cloning behavior and the unconstrained_predictor
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you're referring to this?
https://scikit-learn.org/stable/glossary.html#term-metaestimators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help resolve this. I suggest that we also implement the pattern from CalibratedClassifierCV
link
That would mean, using prefit=False
instead of warm_Start=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry took me a while to concentrate enough to understand what you mean.
I understand that warm_start
is different from what we have here. I'm just not sure why it's really needed. In a usual pipeline, we usually don't have pre-fitted estimators to feed to meta-estimators. Let me open a separate issue to talk about it there.
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Should there be a set of test cases to ensure we remain sklearn compatible? |
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
I'll add the sklearn tests once the existing tests pass. I'm finding it difficult to make them pass since they're a bit obscure to me, but getting there. One question: the output of |
I don't think I understand your question properly. Please don't hesitate to ask if tests seem weird. They need to be made less obscure or explained with a little comment if they are a huge obstacle. |
@@ -292,7 +293,7 @@ class ThresholdOptimizer(ClassifierMixin, BaseEstimator): | |||
|
|||
:param estimator: An untrained estimator that will be trained, and | |||
subsequently its output will be postprocessed | |||
:type estimator: An untrained estimator | |||
:type estimator: An untrained estimator, default=SGDClassifier(loss='log') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to provide a default? If so, why that one? Just generally wondering, but I don't have a strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we couldn't, since this was mutually exclusive with unconstrained_predictor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm…. would be related to the warm_start
below.
@@ -7,10 +7,8 @@ | |||
learn how to adjust the predictor's output from the training data. | |||
""" | |||
|
|||
from ._postprocessing import PostProcessing # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably warrants a description in the CHANGES.md as well :-)
EQUALIZED_ODDS: _threshold_optimization_equalized_odds} | ||
|
||
|
||
def _get_soft_predictions(estimator, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now we're using predict
, not predict_proba
for this. However, when the scores are binary this method doesn't work as well as when using real-value scores so for examples (such as one of the notebooks) we've actually done this manually by wrapping the estimator and mapping predict_proba
to predict
. I'm not familiar with decision_function
, but would we just use predict
if the others aren't available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also adding @MiroDudik for awareness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. As @romanlutz says, previously we were only using predict
as the first stage of the pipeline before applying thresholding. I think it's reasonable to consider the back-off sequence predict_proba
-> decision_function
-> predict
, where in predict_proba
we only look at the predicted probability of the positive class.
So maybe two fixes to the below:
- for
predict_proba
, just extract the second column - after
else
, don't throw an error, but instead returnpredict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict
has very different semantics than predict_proba
or decision_function
. The probabilities predict_proba
gives usually come from the raw values that the decision_function
gives. You can think of the decision_function
values as the distances from the hyperplanes in the case of an SVM.
predict
always gives categorical results, and are not smooth. Either the method (ThresholdOptimizer
) knows how to deal with that or it doesn't. If it doesn't, I don't think falling back to predict
is a good idea. The predict
function of a classifier shouldn't be giving soft outputs.
Alternatively, we could ask the user to either pass a classifier with a predict_proba
or a decision_function
provided, or a regression on a binary classifier which gives soft outputs as the predictions. One of our classifiers uses a regressor under the hoods in sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lacking relevant sklearn knowledge here: does every classifier have either predict_proba
or decision_function
or are there any that don't?
I don't understand what you mean by regression on a binary classifier. How would that be called? In some examples we use predict_proba
from LogisticRegression
and take the second entry (assuming binary classification) which is the probability for label 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThresholdOptimizer
has no problem handling hard 0/1 predictions and that's in fact what the original paper was assuming. That's also the reason why we used predict
in our original implementation. However, in those cases when the classifier is also providing a soft prediction (in the form of decision_function
or predict_proba
), it is obviously better to use that.
So I definitely think that we should allow users to run ThresholdOptimizer
based on predict
of the underlying classifier. What I wonder about is whether we should allow folks to force ThresholdOptimizer
to use predict
even in those cases when the classifier also implements the soft versions.
PROS: it allows you to run the "textbook version" of the algorithm
CONS: it's essentially guaranteed to lead to inferior performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by regression on a binary classifier.
one example is that the GradientBoostingClassifier
uses DecisionTreeRegressor
at each step instead of a DecisionTreeClassifier
. In some cases people use a regressor on the binary classification problem with 0/1 as the target, and they use the soft predicted output as a proxy for probability.
We can have a predict_method
parameter for the users to set it to whatever they want, and have it as auto
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the final suggestion! Let's go with predict_method
parameter defaulted to auto
. For auto
, we can try predict_proba
-> decision_function
-> predict
(also I don't mind dropping the final predict
).
The only weirdness is that whenever predict_method
is predict_proba
, we will need to extract the second column of the output.
""" | ||
|
||
def __init__(self, *, estimator=SGDClassifier(loss='log'), | ||
constraints=DEMOGRAPHIC_PARITY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using SGDClassifier as a default, shall we run it with random_state=0
(or something of the sort) to make it replicate? Alternatively, I don't think we necessarily need to have a default here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very set on the SGDClassifier
, really no hard feelings there. We could instead use HistGradientBoostingClassifier
which is a much more modern one and people seem to like it.
Regarding the random state, since the user can set the estimator
parameter, they can set it while passing a random state to it. It's the same in GridSearch and the underlying estimator for example. If the user needs a deterministic result, they should pass an rng to both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have a default at all? Is this commonly done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of dropping the default value for the estimator. It might even make sense to drop a default on constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn's common tests assume all estimators can be constructed using default values, except for the estimator
or a meta-estimator. So we could leave this as required, and follow the [private] protocol to make the common tests understand how it should construct it. It'll use a Ridge
if the meta-estimator is a regressor and a LinearDiscriminantAnalysis
if a classifier.
Alternatively, the tests on the class can be disabled/skipped and only test instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the right approach is to go the meta-estimator route; this is something we will also want to use in our reductions, so if we implement it here, we can propagate it there.
On the API level, I think that ThresholdOptimizer
should be similar to CalibratedClassifierCV
link
Re. defaults, in CalibratedClassifierCV
, the base_estimator
is defaulted to None
, which becomes LinearSVC(random_state=0)
during fit
. I'm happy to go with whatever plays better with sklearn; if we need to have some default, I think that any linear model would work (LinearSVC, LinearDiscriminantAnalysis, or SGDClassifier), I just thought it would be nicer if the default was deterministic.
@adrinjalali this is closable because you did this in multiple smaller PRs, right? |
#342 should close this one :) |
You probably mean the issue. I meant this PR :-) Are you still using this or is the PR closable? |
This PR moves the
ThresholdOptimizer
towards a sklearn compatible one.Still working on fixing the tests, and adding the sklearn's common tests for the estimator.
The changes are substantial, but I hope none of them are controversial.
p.s. the line limit is 80 here, since longer ones don't easily fit on my screen :D