New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Optimized Preprocessing Algorithm nips 2016 #1340
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rakesh9177 for submitting such an extensive pull request! I personally don't have the time to do an in-depth review at this time, but hopefully someone else from @fairlearn/fairlearn-maintainers does. Left a bunch of higher-level and nitpicky comments that should be fairly easy to address.
from sklearn.base import BaseEstimator, TransformerMixin | ||
|
||
|
||
class OptimizedPreprocessor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that this name is based on the AIF360 implementation, but I don't like it because it's not at all descriptive. From a glance at the abstract (I haven't read the full paper), perhaps something like discrimination-reducing preprocessor is possible? Not sure if I like that much more 😆 @fairlearn/fairlearn-maintainers please chime in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not go for something like "discrimination-reducing" as that could imply the pre-processing algorithm maps neatly upon legal notions of discrimination.
The technique aims to identify a randomized mapping that minimizes the difference between the original distribution of X and Y and the distribution of the pre-processed data ("utility"), subject to a demographic parity ratio constraint ("discrimination control") and an expected individual "distortion" constraint ("distortion control"). The last component determines which transformations are penalized (e.g. swapping labels, changing particular features, etc.). So perhaps a name related to demographic parity and individual distortions would make sense?
self.optim_options = optim_options | ||
self.verbose = verbose | ||
|
||
def fit(self, df, X_features, Y_features, D_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite in line with our usual API for fit (see any other technique in our repo). I'd prefer keeping it consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, have a look at our other preprocesing algorithm, CorrelationRemover.
Although I think we could also simply have something like .fit(X, y, sensitive_features)
rather than __init__(sensitive_feature_ids)
- does anybody recall why we went for sensitive_feature_ids
at initialization in CorrelationRemover? Tagging @fairlearn/fairlearn-maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use a similar approach and it should not take much time to modify.
There are two ways Correlation remover is working right now-
- Using numpy arrays, since this does not have feature names, index has to be passed
2)When a pandas dataframe is passed, it can take advantage of column names
Both are being passed using sensitive_feature_ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signature should be def fit(self, X, y, *kwargs):
where *kwargs
are sensitive features.
We also should take anything, not just a dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed response since I had some interviews!
I changed the code according to scikit documentation - https://scikitlearn.org/stable/modules/generated/sklearn.base.TransformerMixin.html
Please review @adrinjalali
constraints.append(PYhgD[d2, :].T <= PYhgD[d, :].T * (1 + self.epsilon)) | ||
|
||
# Mean distortion | ||
# Pxy_xhyhJoint = (self.dfMask_Pxyd_to_Pxy.values.T).dot(np.diag(PxydMarginal))*Pmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this commented out code be removed?
@@ -6,3 +6,5 @@ pandas>=2.0.3 | |||
pyarrow>=15 | |||
scikit-learn>=1.2.1 | |||
scipy>=1.9.3 | |||
cvxpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a problem... I believe this has come up in the past, too. I wouldn't mind it as a soft dependency. Paging @fairlearn/fairlearn-maintainers 🙂
test/unit/preprocessing/Discrimation_remover_optimized_preprocess/test_optimPreproc.py
Outdated
Show resolved
Hide resolved
test/unit/preprocessing/Discrimation_remover_optimized_preprocess/test_optimPreproc.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
I've only been able to skim the code for now. A few high-level comments:
- Could you add more comments and class/function descriptions to indicate what the different pieces of code are doing? That would make it much easier to review! :)
- Please try to make sure the API corresponds to fairlearn/scikit-learn conventions as much as possible.
- The current implementation relies heavily on Pandas dataframes, which might not be ideal.
- The distortion function seems to be a crucial, but also quite tricky, part of this method. From what I understand so far, it needs to be explicitly defined for a particular dataset. It's not immediately clear to me what would be the best way to implement this in a user-friendly manner.
self.optim_options = optim_options | ||
self.verbose = verbose | ||
|
||
def fit(self, df, X_features, Y_features, D_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular, have a look at our other preprocesing algorithm, CorrelationRemover.
Although I think we could also simply have something like .fit(X, y, sensitive_features)
rather than __init__(sensitive_feature_ids)
- does anybody recall why we went for sensitive_feature_ids
at initialization in CorrelationRemover? Tagging @fairlearn/fairlearn-maintainers
self.opt.compute_marginals() | ||
""" | ||
|
||
def transform(self, df, X_features, Y_features, D_features, transform_Y=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the purpose of transform_Y
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When features are being transformed, there is an option if labels(Y_features) needs to transformed along with data. If a user wishes to transform labels as well along with X features, user can make transform_Y=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing the API, and I see it's far from a scikit-learn compatible API. We should fix that first before merging.
Also, please apply black
to your new files.
and objectives [3]_. | ||
|
||
References: | ||
.. [3] F. P. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 3
and not [1]
from sklearn.base import BaseEstimator, TransformerMixin | ||
|
||
|
||
class OptimizedPreprocessor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should inhering from BaseEstimator
and TransformerMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the code according to -
https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html
self.optim_options = { | ||
"distortion_fun": distortion_function, | ||
"epsilon": epsilon, | ||
"clist": clist, | ||
"dlist": dlist, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__init__
should only store given input values and nothing else.
self.optim_options = optim_options | ||
self.verbose = verbose | ||
|
||
def fit(self, df, X_features, Y_features, D_features): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signature should be def fit(self, X, y, *kwargs):
where *kwargs
are sensitive features.
We also should take anything, not just a dataframe.
Description
closes #1028
Tests
Documentation
Screenshots