Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP sklearn compatible ThresholdOptimizer #282

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

adrinjalali
Copy link
Member

This PR moves the ThresholdOptimizer towards a sklearn compatible one.

Still working on fixing the tests, and adding the sklearn's common tests for the estimator.

The changes are substantial, but I hope none of them are controversial.

p.s. the line limit is 80 here, since longer ones don't easily fit on my screen :D

Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
@romanlutz romanlutz added this to the Integration with scikit-learn milestone Feb 3, 2020
@romanlutz romanlutz added ease-of-use enhancement New feature or request labels Feb 3, 2020

def __init__(self, *, estimator=None, constraints=DEMOGRAPHIC_PARITY,
grid_size=1000, flip=True, plot=False, random_state=None):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have lost unconstrained_predictor and all of the related logic. Was this intentional? The idea is that if this argument is present, we should use it in fit to initialize self._estimator (without calling self.unconstrained_predictor.fit(...)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrinjalali can elaborate but from what I understand scikit-learn doesn't take already fitted models, but rather fits them itself. We can still have another version where we take an already fitted one, but it wouldn't be scikit-learn compatible. Is that about accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. GridSearchCV clones the estimators before fitting them, and that clears all the state they have. If we want to allow the user to re-fit on using a pre-fitted estimator, we can implement a warm_start strategy for the estimator to use the previously fitted one. But by default all of these should be fit from scratch, since grid search is fitting them on a new part of the data each time anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. Would be sad to lose the existing functionality if we're planning to add it back in anyway... any chance we can add that warm start in this PR, or is that a major change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added warm_start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a delay on this, but I think that warm_star is confusing in this context.

Just to state the obvious: the high-level semantics of ThresholdOptimizer (after it is fitted) is really of a two-stage pipeline. The first step is implemented by self._estimator.predict(). It takes as input X and returns y. The returned y is then transformed into y' (inside self.predict()).

We would like to enable two styles of fitting this pipeline: (1) fitting both steps from the provided data; or (2) fitting only the second step from the provided data (and using the provided unconstrained_predictor as the first step). Now if I think about a warm start, I would expect re-fitting a previously fitted pipeline, using the previous pipeline as a warm start. This to me is quite different from (2), which is just fitting the second stage of the pipeline (from scratch) while keeping the first stage of the pipeline untouched.

I'm not quite sure I understand the issue with the interaction between the cloning behavior and the unconstrained_predictor argument.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help resolve this. I suggest that we also implement the pattern from CalibratedClassifierCV link

That would mean, using prefit=False instead of warm_Start=False.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry took me a while to concentrate enough to understand what you mean.

I understand that warm_start is different from what we have here. I'm just not sure why it's really needed. In a usual pipeline, we usually don't have pre-fitted estimators to feed to meta-estimators. Let me open a separate issue to talk about it there.

Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
@romanlutz
Copy link
Member

Should there be a set of test cases to ensure we remain sklearn compatible?

Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
Signed-off-by: adrinjalali <adrin.jalali@gmail.com>
@adrinjalali
Copy link
Member Author

I'll add the sklearn tests once the existing tests pass.

I'm finding it difficult to make them pass since they're a bit obscure to me, but getting there.

One question: the output of predict_proba is of shape (n_samples, n_classes). How should we use them?

@romanlutz
Copy link
Member

I'll add the sklearn tests once the existing tests pass.

I'm finding it difficult to make them pass since they're a bit obscure to me, but getting there.

One question: the output of predict_proba is of shape (n_samples, n_classes). How should we use them?

I don't think I understand your question properly. predict_proba has the probabilities, so each first entry is the probability for class 0 and the second entry the probability for class 1, etc. I think most of our work is binary classification right now, so n_classes = 2.

Please don't hesitate to ask if tests seem weird. They need to be made less obscure or explained with a little comment if they are a huge obstacle.

@@ -292,7 +293,7 @@ class ThresholdOptimizer(ClassifierMixin, BaseEstimator):

:param estimator: An untrained estimator that will be trained, and
subsequently its output will be postprocessed
:type estimator: An untrained estimator
:type estimator: An untrained estimator, default=SGDClassifier(loss='log')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to provide a default? If so, why that one? Just generally wondering, but I don't have a strong opinion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we couldn't, since this was mutually exclusive with unconstrained_predictor ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm…. would be related to the warm_start below.

@@ -7,10 +7,8 @@
learn how to adjust the predictor's output from the training data.
"""

from ._postprocessing import PostProcessing # noqa: F401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably warrants a description in the CHANGES.md as well :-)

EQUALIZED_ODDS: _threshold_optimization_equalized_odds}


def _get_soft_predictions(estimator, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now we're using predict, not predict_proba for this. However, when the scores are binary this method doesn't work as well as when using real-value scores so for examples (such as one of the notebooks) we've actually done this manually by wrapping the estimator and mapping predict_proba to predict. I'm not familiar with decision_function, but would we just use predict if the others aren't available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding @MiroDudik for awareness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. As @romanlutz says, previously we were only using predict as the first stage of the pipeline before applying thresholding. I think it's reasonable to consider the back-off sequence predict_proba -> decision_function -> predict, where in predict_proba we only look at the predicted probability of the positive class.

So maybe two fixes to the below:

  • for predict_proba, just extract the second column
  • after else, don't throw an error, but instead return predict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict has very different semantics than predict_proba or decision_function. The probabilities predict_proba gives usually come from the raw values that the decision_function gives. You can think of the decision_function values as the distances from the hyperplanes in the case of an SVM.

predict always gives categorical results, and are not smooth. Either the method (ThresholdOptimizer) knows how to deal with that or it doesn't. If it doesn't, I don't think falling back to predict is a good idea. The predict function of a classifier shouldn't be giving soft outputs.

Alternatively, we could ask the user to either pass a classifier with a predict_proba or a decision_function provided, or a regression on a binary classifier which gives soft outputs as the predictions. One of our classifiers uses a regressor under the hoods in sklearn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lacking relevant sklearn knowledge here: does every classifier have either predict_proba or decision_function or are there any that don't?
I don't understand what you mean by regression on a binary classifier. How would that be called? In some examples we use predict_proba from LogisticRegression and take the second entry (assuming binary classification) which is the probability for label 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThresholdOptimizer has no problem handling hard 0/1 predictions and that's in fact what the original paper was assuming. That's also the reason why we used predict in our original implementation. However, in those cases when the classifier is also providing a soft prediction (in the form of decision_function or predict_proba), it is obviously better to use that.

So I definitely think that we should allow users to run ThresholdOptimizer based on predict of the underlying classifier. What I wonder about is whether we should allow folks to force ThresholdOptimizer to use predict even in those cases when the classifier also implements the soft versions.

PROS: it allows you to run the "textbook version" of the algorithm
CONS: it's essentially guaranteed to lead to inferior performance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by regression on a binary classifier.

one example is that the GradientBoostingClassifier uses DecisionTreeRegressor at each step instead of a DecisionTreeClassifier. In some cases people use a regressor on the binary classification problem with 0/1 as the target, and they use the soft predicted output as a proxy for probability.

We can have a predict_method parameter for the users to set it to whatever they want, and have it as auto by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the final suggestion! Let's go with predict_method parameter defaulted to auto. For auto, we can try predict_proba -> decision_function -> predict (also I don't mind dropping the final predict).

The only weirdness is that whenever predict_method is predict_proba, we will need to extract the second column of the output.

"""

def __init__(self, *, estimator=SGDClassifier(loss='log'),
constraints=DEMOGRAPHIC_PARITY,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using SGDClassifier as a default, shall we run it with random_state=0 (or something of the sort) to make it replicate? Alternatively, I don't think we necessarily need to have a default here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very set on the SGDClassifier, really no hard feelings there. We could instead use HistGradientBoostingClassifier which is a much more modern one and people seem to like it.

Regarding the random state, since the user can set the estimator parameter, they can set it while passing a random state to it. It's the same in GridSearch and the underlying estimator for example. If the user needs a deterministic result, they should pass an rng to both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a default at all? Is this commonly done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of dropping the default value for the estimator. It might even make sense to drop a default on constraints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn's common tests assume all estimators can be constructed using default values, except for the estimator or a meta-estimator. So we could leave this as required, and follow the [private] protocol to make the common tests understand how it should construct it. It'll use a Ridge if the meta-estimator is a regressor and a LinearDiscriminantAnalysis if a classifier.

Alternatively, the tests on the class can be disabled/skipped and only test instances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the right approach is to go the meta-estimator route; this is something we will also want to use in our reductions, so if we implement it here, we can propagate it there.

On the API level, I think that ThresholdOptimizer should be similar to CalibratedClassifierCV link

Re. defaults, in CalibratedClassifierCV, the base_estimator is defaulted to None, which becomes LinearSVC(random_state=0) during fit. I'm happy to go with whatever plays better with sklearn; if we need to have some default, I think that any linear model would work (LinearSVC, LinearDiscriminantAnalysis, or SGDClassifier), I just thought it would be nicer if the default was deterministic.

@romanlutz
Copy link
Member

@adrinjalali this is closable because you did this in multiple smaller PRs, right?

@adrinjalali
Copy link
Member Author

#342 should close this one :)

@romanlutz
Copy link
Member

#342 should close this one :)

You probably mean the issue. I meant this PR :-) Are you still using this or is the PR closable?

@romanlutz romanlutz removed this from the scikit-learn compatibility milestone Jun 22, 2020
Base automatically changed from master to main February 6, 2021 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
fairlearn
  
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

4 participants