Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Optimized Preprocessing Algorithm nips 2016 #1340

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rakesh9177
Copy link

@rakesh9177 rakesh9177 commented Jan 29, 2024

Description

closes #1028

Tests

  • no new tests required
  • new tests added
  • existing tests adjusted

Documentation

  • no documentation changes needed
  • user guide added or updated
  • API docs added or updated
  • example notebook added or updated

Screenshots

Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rakesh9177 for submitting such an extensive pull request! I personally don't have the time to do an in-depth review at this time, but hopefully someone else from @fairlearn/fairlearn-maintainers does. Left a bunch of higher-level and nitpicky comments that should be fairly easy to address.

from sklearn.base import BaseEstimator, TransformerMixin


class OptimizedPreprocessor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this name is based on the AIF360 implementation, but I don't like it because it's not at all descriptive. From a glance at the abstract (I haven't read the full paper), perhaps something like discrimination-reducing preprocessor is possible? Not sure if I like that much more 😆 @fairlearn/fairlearn-maintainers please chime in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not go for something like "discrimination-reducing" as that could imply the pre-processing algorithm maps neatly upon legal notions of discrimination.

The technique aims to identify a randomized mapping that minimizes the difference between the original distribution of X and Y and the distribution of the pre-processed data ("utility"), subject to a demographic parity ratio constraint ("discrimination control") and an expected individual "distortion" constraint ("distortion control"). The last component determines which transformations are penalized (e.g. swapping labels, changing particular features, etc.). So perhaps a name related to demographic parity and individual distortions would make sense?

fairlearn/preprocessing/_optimPreproc.py Outdated Show resolved Hide resolved
self.optim_options = optim_options
self.verbose = verbose

def fit(self, df, X_features, Y_features, D_features):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite in line with our usual API for fit (see any other technique in our repo). I'd prefer keeping it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, have a look at our other preprocesing algorithm, CorrelationRemover.

Although I think we could also simply have something like .fit(X, y, sensitive_features) rather than __init__(sensitive_feature_ids) - does anybody recall why we went for sensitive_feature_idsat initialization in CorrelationRemover? Tagging @fairlearn/fairlearn-maintainers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use a similar approach and it should not take much time to modify.

There are two ways Correlation remover is working right now-

  1. Using numpy arrays, since this does not have feature names, index has to be passed
    2)When a pandas dataframe is passed, it can take advantage of column names

Both are being passed using sensitive_feature_ids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signature should be def fit(self, X, y, *kwargs): where *kwargs are sensitive features.

We also should take anything, not just a dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response since I had some interviews!

I changed the code according to scikit documentation - https://scikitlearn.org/stable/modules/generated/sklearn.base.TransformerMixin.html

Please review @adrinjalali

fairlearn/preprocessing/_optimPreproc.py Outdated Show resolved Hide resolved
fairlearn/preprocessing/_optimPreproc.py Outdated Show resolved Hide resolved
fairlearn/preprocessing/_optimPreproc_helper.py Outdated Show resolved Hide resolved
constraints.append(PYhgD[d2, :].T <= PYhgD[d, :].T * (1 + self.epsilon))

# Mean distortion
# Pxy_xhyhJoint = (self.dfMask_Pxyd_to_Pxy.values.T).dot(np.diag(PxydMarginal))*Pmap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this commented out code be removed?

@@ -6,3 +6,5 @@ pandas>=2.0.3
pyarrow>=15
scikit-learn>=1.2.1
scipy>=1.9.3
cvxpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a problem... I believe this has come up in the past, too. I wouldn't mind it as a soft dependency. Paging @fairlearn/fairlearn-maintainers 🙂

Copy link
Contributor

@hildeweerts hildeweerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!
I've only been able to skim the code for now. A few high-level comments:

  • Could you add more comments and class/function descriptions to indicate what the different pieces of code are doing? That would make it much easier to review! :)
  • Please try to make sure the API corresponds to fairlearn/scikit-learn conventions as much as possible.
  • The current implementation relies heavily on Pandas dataframes, which might not be ideal.
  • The distortion function seems to be a crucial, but also quite tricky, part of this method. From what I understand so far, it needs to be explicitly defined for a particular dataset. It's not immediately clear to me what would be the best way to implement this in a user-friendly manner.

self.optim_options = optim_options
self.verbose = verbose

def fit(self, df, X_features, Y_features, D_features):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, have a look at our other preprocesing algorithm, CorrelationRemover.

Although I think we could also simply have something like .fit(X, y, sensitive_features) rather than __init__(sensitive_feature_ids) - does anybody recall why we went for sensitive_feature_idsat initialization in CorrelationRemover? Tagging @fairlearn/fairlearn-maintainers

fairlearn/preprocessing/_optimPreproc.py Outdated Show resolved Hide resolved
fairlearn/preprocessing/_optimPreproc.py Outdated Show resolved Hide resolved
self.opt.compute_marginals()
"""

def transform(self, df, X_features, Y_features, D_features, transform_Y=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the purpose of transform_Y?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When features are being transformed, there is an option if labels(Y_features) needs to transformed along with data. If a user wishes to transform labels as well along with X features, user can make transform_Y=True.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing the API, and I see it's far from a scikit-learn compatible API. We should fix that first before merging.

Also, please apply black to your new files.

and objectives [3]_.

References:
.. [3] F. P. Calmon, D. Wei, B. Vinzamuri, K. Natesan Ramamurthy, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 3 and not [1]

from sklearn.base import BaseEstimator, TransformerMixin


class OptimizedPreprocessor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should inhering from BaseEstimator and TransformerMixin

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +27 to +32
self.optim_options = {
"distortion_fun": distortion_function,
"epsilon": epsilon,
"clist": clist,
"dlist": dlist,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__init__ should only store given input values and nothing else.

self.optim_options = optim_options
self.verbose = verbose

def fit(self, df, X_features, Y_features, D_features):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signature should be def fit(self, X, y, *kwargs): where *kwargs are sensitive features.

We also should take anything, not just a dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH Add mitigation algorithm from "Optimized Pre-Processing for Discrimination Prevention" by Calmon et al.
4 participants