Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Synthetic dataset creation #907

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

coreysharris
Copy link

This PR adds functionality for creating synthetic datasets. The starting point is the snippet contributed by @romanlutz in #793. I've included a simple example notebook for demonstration purposes, but this could be removed as the work here gets closer to finished.

Closes #793.

Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests, too?

examples/plot_synthetic.py Outdated Show resolved Hide resolved
examples/plot_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/__init__.py Outdated Show resolved Hide resolved
fairlearn/datasets/__init__.py Outdated Show resolved Hide resolved
examples/plot_synthetic.py Outdated Show resolved Hide resolved
examples/plot_synthetic.py Outdated Show resolved Hide resolved
@romanlutz
Copy link
Member

Ah before I forget: Could you add a line to CHANGES.md describing your change (in the 0.7.1 section since that'll be the next release). Additionally, feel free to add your name to AUTHORS.md if you like.

CHANGES.md Outdated Show resolved Hide resolved
@coreysharris
Copy link
Author

I've cleaned things up to hopefully clarify how we could expand into more interesting waters. The code that defines the gender feature is now quite small and could easily be separated into another function.

@coreysharris coreysharris changed the title [WIP] Synthetic dataset creation Synthetic dataset creation Jul 17, 2021
Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coreysharris this is fantastic and I've pointed out everything I possibly could since we're getting very close to merging this.

Do not worry about the currently failing (single) PR-Gate pipeline. That's a manifestation of the issue I described in #757 (comment) and won't be a blocker for your PR.

fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
X, y, gender, test_size=0.3, random_state=rng
)

assert np.sum(X_train[0] < 0) == 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two lines just random aggregate stats you found? If so, perhaps asserting the dimensions and mean of the features and targets is useful?

test/unit/datasets/test_datasets.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
@@ -38,3 +40,26 @@ def test_dataset_as_X_y(self, as_frame, fetch_function):
assert isinstance(X, pd.DataFrame if as_frame else np.ndarray)
assert y is not None
assert isinstance(y, pd.Series if as_frame else np.ndarray)

def test_synthetic_datasets(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the following test cases as well?

  • pass in custom configuration with a different number of samples than the default and assert that the sum of the n_samples values adds up to the total size
  • pass in a different number of features and assert the resulting array dimension

Bonus: try out invalid args

  • n_features=0
  • empty dict for classes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it stands, such invalid args result in a ValueError. Is that acceptable, or should we do some kind of sanitization for the user? I'm specifically thinking about classes (now called feature_config). It's not very user-friendly to ask for this dict in the first place, and I hope we can replace it with something cleaner in the follow-up, but for now I wonder if we should try to ensure that the dict has entries of the form 'label': {'n_samples': n_samples, 'class_sep': class_sep}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine 🙂 We have plenty of tests for invalid inputs where we assert the correct errors/exceptions are raised. Typically comes out as

with pytest.raises(ValueError) as error:
    assert str(error.value) == expected_msg

I completely agree on the dict not being ideal. I started thinking a bit about having a well-defined config object to which you can add groups. Perhaps something that would work like this:

sensitive_feature_config = SensitiveFeatureConfig(names=["gender", "age"])
sensitive_feature_config.add_group(gender="male", "age"="0-20", n_samples=123, class_sep=1.1, ...)
sensitive_feature_config.add_group(gender="female", "age"="0-20", n_samples=321, class_sep=0.6, ...)
sensitive_feature_config.add_group(gender="non-binary", "age"="0-20", n_samples=231, class_sep=1.7, ...)
...
sensitive_feature_config.add_group(gender="male", "age"="80+", n_samples=103, class_sep=1.05, ...)
sensitive_feature_config.add_group(gender="female", "age"="80+", n_samples=183, class_sep=1.4, ...)
sensitive_feature_config.add_group(gender="non-binary", "age"="80+", n_samples=123, class_sep=0.2, ...)
...

[I would allow all sorts of args to be passed through to make_classification via kwargs.]

Again, I think this goes quite a bit beyond the basic version you have in this PR, so I wouldn't expect this right away, and I feel like we should get some more input on this from others such as @adrinjalali @MiroDudik @riedgar-ms @hildeweerts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice way to handle multiple features. I hadn't thought of dealing directly with group intersections in the config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @romanlutz 's proposal, except that I'd have an object for each sensitive feature, and not have the Cartesian product written in the code.

I think it's worth having the API as something we like in the first version. I rather not have to deprecate later since we already know we don't like the API as is. I'm happy to have a SensitiveFeatureGenerator object, but the API needs to be a bit less verbose than what's proposed by Roman. I'm not sure if we have to give users this much control over how the data is generated. If they really need this much control, they can write their own generator. This function is supposed to generate some dummy data for demonstration purposes. It's not supposed to generate data similar to the real world issues we have.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried something like this here: coreysharris#1. I opened it as a PR against this branch so as not to muddy the waters here if people don't like it. I think the API is much cleaner, and it makes it easier to extend functionality later.

One note: although it looks like it now supports multiple features, that's not the case. The implementation needs to be reworked, but I haven't gotten that far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I missed this, sorry for the delay!
I love it! Do you want to pull that change into your branch so that we can review it as part of this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great! In order to support multiple features, the API has to change slightly. It looks more and more like your suggestion above 😛


Parameters
----------
classes : dict, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with the arg name classes since that's typically used for target classes (as in multi-class classification, for example). Groups is the term I'd prefer, but it's also a config dictionary for the groups, so perhaps group_config? I strongly suspect we'll expand this in the future to work for multiple sensitive features, so this arg will likely evolve.

This is also focused on binary classification right now (which is perfectly fine!), so perhaps we should include that in the name (similar to sklearn's make_classification)?

@adrinjalali @hildeweerts @riedgar-ms @MiroDudik any thoughts on naming and API?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced classes with feature_config. See #907 (comment) .

I actually originally called it make_classification, rather than make_synthetic_dataset! I replaced it because I wasn't sure about overloading such a common function from scikit-learn. What about make_sensitive_classification or make_gendered_classification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly not an unreasonable thought. I'll wait for others to chime in (at least @adrinjalali may have thoughts), but make_sensitive_classification sounds like a name that'll work even with more sensitive features (even though that may not be part of the initial PR).

@coreysharris
Copy link
Author

I've brought in the changes from my other PR, but as I mentioned above, it doesn't quite work if we want to support multiple sensitive features. I'll make the necessary changes and update.

@romanlutz
Copy link
Member

Thanks @coreysharris ! Please ping me and perhaps fairlearn/fairlearn-maintainers so that we don't miss it 🙂

@romanlutz romanlutz mentioned this pull request Sep 11, 2021
6 tasks
@coreysharris
Copy link
Author

@romanlutz @adrinjalali @riedgar-ms @LeJit This is now working for multiple sensitive features.
It is in much more rough shape than the previous versions were, because I want to make sure you like this before I smooth out the sharp edges.

Here is the MetricFrame generated by the example notebook:
Untitled

Copy link
Member

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going! Thanks a ton for making all the requested changes, @coreysharris !

@fairlearn/fairlearn-maintainers you should really take a look at this and provide feedback now if you have thoughts 🙂

docs/user_guide/installation_and_version_guide/v0.7.1.rst Outdated Show resolved Hide resolved
dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)

# need a convenience method here
dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this example I'm not entirely sure whether the other groups were forgotten or whether it will "just work" by having all the other groups with random sizes. Can you add a little bit of description (as @adrinjalali suggested in another comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading the code I realize now that all groups will have 300 rows except man<21 which will have 100. Originally, I had assumed that 300 is the total size and since man<21 gets 100 rows the remaining 200 are divided up into the other groups at random (or perhaps equally). This misunderstanding of mine shows that we'll need to be a bit clearer in explaining the functionality. I can see a few ways:

  • rename the n_samples param because it's really n_samples_per_group OR change functionality (e.g., making that the total number of samples, but that's probably harder to achieve since the other way is already implemented)
  • add a comment here and a nice docstring in the class explaining what's happening under the hood
  • explain what happens to the groups that aren't manually configured
    The bar plot below should show the group sizes, so that's great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by this and I like @romanlutz's suggested solution!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to n_samples_per_group.

OR change functionality (e.g., making that the total number of samples, but that's probably harder to achieve since the other way is already implemented)

I did implement this in the previous iteration, but I've removed it here, just to reduce complexity. If you think it would be useful it wouldn't be hard to bring back in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with n_samples_per_group. Would love to hear other @fairlearn/fairlearn-maintainers opinions, though!

fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Outdated Show resolved Hide resolved
fairlearn/datasets/_synthetic.py Show resolved Hide resolved
assert y.shape == (2500,)
assert gender.shape == (2500,)

def test_custom_make_sensitive_classification(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be awesome to have a test case with at least 2 sensitive features, too. In the current setup you seem to allow 0 as well, but if that's still changing we don't need a test since it would be obsolete. If you see 0 as a legitimate use case we'll need a test, too, of course.

Comment on lines 85 to 89
'n_features': 20,
'n_informative': 4,
'n_classes': 2,
'n_samples': 50,
'random_state': self.rng,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like rather important default values that should be highlighted in the docstring of this method.

Comment on lines +15 to +22
""" `group_dict` is a dict assigning to a group in each sensitive
feature category. E.g.,::
{
'gender': 'Other',
'age': '50-60',
}
`classification_kwargs` is the set of kwargs that will be passed on to
make_classification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be preferable to have a description of the class, then an example and the parameters section.

dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)

# need a convenience method here
dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to skew the parameter passed to make_classification? Otherwise the classes will be indistinguishable for the model.

Copy link
Contributor

@hildeweerts hildeweerts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the docstrings are incomplete or missing. Could you please make sure to add one to each class and function?

It is also not 100% clear to me what specific form of unfairness/bias is simulated in the dataset (which I expect will be much clearer once you've updated the docstrings and added more detailed comments to the example). Am I correct in assuming it includes (1) a different data distribution for each sensitive group, (2) option to generate different number of samples per sensitive group, but not (3) option to have different base rate (i.e., proportion of positives) for each sensitive group?

Thanks!

rng = np.random.RandomState(42)

gender_feature = SensitiveFeature('Gender', ['Man', 'Other', 'Unspecified', 'Woman'])
age_feature = SensitiveFeature('Age', ['<21', '21-64', '>64'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely nitpick, but using 21 as an age cut-off is a bit US-centric :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced these with more arbitrary values 😄

dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)

# need a convenience method here
dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by this and I like @romanlutz's suggested solution!

@coreysharris
Copy link
Author

Am I correct in assuming it includes (1) a different data distribution for each sensitive group, (2) option to generate different number of samples per sensitive group, but not (3) option to have different base rate (i.e., proportion of positives) for each sensitive group?

Better documentation to come, but the short answer is that all three are included. (3) is randomized for each group, but can be overwritten by the user.

@romanlutz
Copy link
Member

@coreysharris I just wanted to check in on this in case you need help on anything. Seems like the main missing piece is documentation. Let us know 🙂 We have weekly community calls in case you want to talk about anything: https://fairlearn.org/main/contributor_guide/index.html#community-calls

@romanlutz
Copy link
Member

@coreysharris I'm checking in on a few open PRs to see if the author is still interested in continuing. If not, someone else can pick it up from here and continue. Of course, if you are still interested it's best that you continue even if it's not right away (no hurry!). Just let us know! If there's no response by 12/28 (two weeks from today) I'll assume others can pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH Creating a synthetic example dataset
6 participants