FEAT Synthetic dataset creation #907

coreysharris · 2021-07-17T18:37:14Z

This PR adds functionality for creating synthetic datasets. The starting point is the snippet contributed by @romanlutz in #793. I've included a simple example notebook for demonstration purposes, but this could be removed as the work here gets closer to finished.

Closes #793.

romanlutz

Can you add some tests, too?

examples/plot_synthetic.py

fairlearn/datasets/synthetic.py

fairlearn/datasets/__init__.py

fairlearn/datasets/synthetic.py

fairlearn/datasets/__init__.py

examples/plot_synthetic.py

romanlutz · 2021-07-17T19:12:06Z

Ah before I forget: Could you add a line to CHANGES.md describing your change (in the 0.7.1 section since that'll be the next release). Additionally, feel free to add your name to AUTHORS.md if you like.

CHANGES.md

coreysharris · 2021-07-17T21:46:42Z

I've cleaned things up to hopefully clarify how we could expand into more interesting waters. The code that defines the gender feature is now quite small and could easily be separated into another function.

romanlutz

@coreysharris this is fantastic and I've pointed out everything I possibly could since we're getting very close to merging this.

Do not worry about the currently failing (single) PR-Gate pipeline. That's a manifestation of the issue I described in #757 (comment) and won't be a blocker for your PR.

fairlearn/datasets/_synthetic.py

romanlutz · 2021-07-18T07:26:45Z

test/unit/datasets/test_datasets.py

+            X, y, gender, test_size=0.3, random_state=rng
+        )
+
+        assert np.sum(X_train[0] < 0) == 8


Are these two lines just random aggregate stats you found? If so, perhaps asserting the dimensions and mean of the features and targets is useful?

test/unit/datasets/test_datasets.py

fairlearn/datasets/_synthetic.py

romanlutz · 2021-07-18T07:43:04Z

test/unit/datasets/test_datasets.py

@@ -38,3 +40,26 @@ def test_dataset_as_X_y(self, as_frame, fetch_function):
        assert isinstance(X, pd.DataFrame if as_frame else np.ndarray)
        assert y is not None
        assert isinstance(y, pd.Series if as_frame else np.ndarray)
+
+    def test_synthetic_datasets(self):


Can we add the following test cases as well?

pass in custom configuration with a different number of samples than the default and assert that the sum of the n_samples values adds up to the total size

pass in a different number of features and assert the resulting array dimension

Bonus: try out invalid args

n_features=0

empty dict for classes

As it stands, such invalid args result in a ValueError. Is that acceptable, or should we do some kind of sanitization for the user? I'm specifically thinking about classes (now called feature_config). It's not very user-friendly to ask for this dict in the first place, and I hope we can replace it with something cleaner in the follow-up, but for now I wonder if we should try to ensure that the dict has entries of the form 'label': {'n_samples': n_samples, 'class_sep': class_sep}.

That's fine 🙂 We have plenty of tests for invalid inputs where we assert the correct errors/exceptions are raised. Typically comes out as

with pytest.raises(ValueError) as error: assert str(error.value) == expected_msg

I completely agree on the dict not being ideal. I started thinking a bit about having a well-defined config object to which you can add groups. Perhaps something that would work like this:

sensitive_feature_config = SensitiveFeatureConfig(names=["gender", "age"]) sensitive_feature_config.add_group(gender="male", "age"="0-20", n_samples=123, class_sep=1.1, ...) sensitive_feature_config.add_group(gender="female", "age"="0-20", n_samples=321, class_sep=0.6, ...) sensitive_feature_config.add_group(gender="non-binary", "age"="0-20", n_samples=231, class_sep=1.7, ...) ... sensitive_feature_config.add_group(gender="male", "age"="80+", n_samples=103, class_sep=1.05, ...) sensitive_feature_config.add_group(gender="female", "age"="80+", n_samples=183, class_sep=1.4, ...) sensitive_feature_config.add_group(gender="non-binary", "age"="80+", n_samples=123, class_sep=0.2, ...) ...

[I would allow all sorts of args to be passed through to make_classification via kwargs.]

Again, I think this goes quite a bit beyond the basic version you have in this PR, so I wouldn't expect this right away, and I feel like we should get some more input on this from others such as @adrinjalali @MiroDudik @riedgar-ms @hildeweerts

This seems like a nice way to handle multiple features. I hadn't thought of dealing directly with group intersections in the config.

I like @romanlutz 's proposal, except that I'd have an object for each sensitive feature, and not have the Cartesian product written in the code.

I think it's worth having the API as something we like in the first version. I rather not have to deprecate later since we already know we don't like the API as is. I'm happy to have a SensitiveFeatureGenerator object, but the API needs to be a bit less verbose than what's proposed by Roman. I'm not sure if we have to give users this much control over how the data is generated. If they really need this much control, they can write their own generator. This function is supposed to generate some dummy data for demonstration purposes. It's not supposed to generate data similar to the real world issues we have.

I've tried something like this here: coreysharris#1. I opened it as a PR against this branch so as not to muddy the waters here if people don't like it. I think the API is much cleaner, and it makes it easier to extend functionality later.

One note: although it looks like it now supports multiple features, that's not the case. The implementation needs to be reworked, but I haven't gotten that far.

Somehow I missed this, sorry for the delay!
I love it! Do you want to pull that change into your branch so that we can review it as part of this PR?

Ok, great! In order to support multiple features, the API has to change slightly. It looks more and more like your suggestion above 😛

romanlutz · 2021-07-18T07:48:00Z

fairlearn/datasets/_synthetic.py

+
+    Parameters
+    ----------
+    classes : dict, default=None


I'm not super happy with the arg name classes since that's typically used for target classes (as in multi-class classification, for example). Groups is the term I'd prefer, but it's also a config dictionary for the groups, so perhaps group_config? I strongly suspect we'll expand this in the future to work for multiple sensitive features, so this arg will likely evolve.

This is also focused on binary classification right now (which is perfectly fine!), so perhaps we should include that in the name (similar to sklearn's make_classification)?

@adrinjalali @hildeweerts @riedgar-ms @MiroDudik any thoughts on naming and API?

I've replaced classes with feature_config. See #907 (comment) .

I actually originally called it make_classification, rather than make_synthetic_dataset! I replaced it because I wasn't sure about overloading such a common function from scikit-learn. What about make_sensitive_classification or make_gendered_classification?

It's certainly not an unreasonable thought. I'll wait for others to chime in (at least @adrinjalali may have thoughts), but make_sensitive_classification sounds like a name that'll work even with more sensitive features (even though that may not be part of the initial PR).

coreysharris · 2021-08-26T14:42:03Z

I've brought in the changes from my other PR, but as I mentioned above, it doesn't quite work if we want to support multiple sensitive features. I'll make the necessary changes and update.

romanlutz · 2021-08-26T15:28:10Z

Thanks @coreysharris ! Please ping me and perhaps fairlearn/fairlearn-maintainers so that we don't miss it 🙂

coreysharris · 2021-09-17T04:42:21Z

@romanlutz @adrinjalali @riedgar-ms @LeJit This is now working for multiple sensitive features.
It is in much more rough shape than the previous versions were, because I want to make sure you like this before I smooth out the sharp edges.

Here is the MetricFrame generated by the example notebook:

romanlutz

I like where this is going! Thanks a ton for making all the requested changes, @coreysharris !

@fairlearn/fairlearn-maintainers you should really take a look at this and provide feedback now if you have thoughts 🙂

docs/user_guide/installation_and_version_guide/v0.7.1.rst

romanlutz · 2021-09-17T17:21:12Z

examples/plot_synthetic.py

+dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)
+
+# need a convenience method here
+dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100


When reading this example I'm not entirely sure whether the other groups were forgotten or whether it will "just work" by having all the other groups with random sizes. Can you add a little bit of description (as @adrinjalali suggested in another comment)?

After re-reading the code I realize now that all groups will have 300 rows except man<21 which will have 100. Originally, I had assumed that 300 is the total size and since man<21 gets 100 rows the remaining 200 are divided up into the other groups at random (or perhaps equally). This misunderstanding of mine shows that we'll need to be a bit clearer in explaining the functionality. I can see a few ways:

rename the n_samples param because it's really n_samples_per_group OR change functionality (e.g., making that the total number of samples, but that's probably harder to achieve since the other way is already implemented)

add a comment here and a nice docstring in the class explaining what's happening under the hood

explain what happens to the groups that aren't manually configured
The bar plot below should show the group sizes, so that's great!

I was also confused by this and I like @romanlutz's suggested solution!

Changed to n_samples_per_group.

OR change functionality (e.g., making that the total number of samples, but that's probably harder to achieve since the other way is already implemented)

I did implement this in the previous iteration, but I've removed it here, just to reduce complexity. If you think it would be useful it wouldn't be hard to bring back in.

I'm fine with n_samples_per_group. Would love to hear other @fairlearn/fairlearn-maintainers opinions, though!

fairlearn/datasets/_synthetic.py

romanlutz · 2021-09-17T17:29:04Z

test/unit/datasets/test_datasets.py

+        assert y.shape == (2500,)
+        assert gender.shape == (2500,)
+
+    def test_custom_make_sensitive_classification(self):


Would be awesome to have a test case with at least 2 sensitive features, too. In the current setup you seem to allow 0 as well, but if that's still changing we don't need a test since it would be obsolete. If you see 0 as a legitimate use case we'll need a test, too, of course.

romanlutz · 2021-09-17T17:33:06Z

fairlearn/datasets/_synthetic.py

+            'n_features': 20,
+            'n_informative': 4,
+            'n_classes': 2,
+            'n_samples': 50,
+            'random_state': self.rng,


These seem like rather important default values that should be highlighted in the docstring of this method.

romanlutz · 2021-09-17T17:40:06Z

fairlearn/datasets/_synthetic.py

+        """ `group_dict` is a dict assigning to a group in each sensitive
+        feature category. E.g.,::
+            {
+                'gender': 'Other',
+                'age': '50-60',
+            }
+        `classification_kwargs` is the set of kwargs that will be passed on to
+        make_classification.


It may be preferable to have a description of the class, then an example and the parameters section.

romanlutz · 2021-09-17T17:41:37Z

examples/plot_synthetic.py

+dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)
+
+# need a convenience method here
+dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100


don't we need to skew the parameter passed to make_classification? Otherwise the classes will be indistinguishable for the model.

hildeweerts

Many of the docstrings are incomplete or missing. Could you please make sure to add one to each class and function?

It is also not 100% clear to me what specific form of unfairness/bias is simulated in the dataset (which I expect will be much clearer once you've updated the docstrings and added more detailed comments to the example). Am I correct in assuming it includes (1) a different data distribution for each sensitive group, (2) option to generate different number of samples per sensitive group, but not (3) option to have different base rate (i.e., proportion of positives) for each sensitive group?

Thanks!

hildeweerts · 2021-09-20T14:56:02Z

examples/plot_synthetic.py

+rng = np.random.RandomState(42)
+
+gender_feature = SensitiveFeature('Gender', ['Man', 'Other', 'Unspecified', 'Woman'])
+age_feature = SensitiveFeature('Age', ['<21', '21-64', '>64'])


Extremely nitpick, but using 21 as an age cut-off is a bit US-centric :)

I've replaced these with more arbitrary values 😄

hildeweerts · 2021-09-20T15:10:31Z

examples/plot_synthetic.py

+dataset = SensitiveDatasetMaker(sensitive_features=[gender_feature, age_feature], random_state=rng)
+
+# need a convenience method here
+dataset.configured_groups[('Man', '<21')].classification_kwargs['n_samples'] = 100


I was also confused by this and I like @romanlutz's suggested solution!

coreysharris · 2021-09-25T22:15:51Z

Am I correct in assuming it includes (1) a different data distribution for each sensitive group, (2) option to generate different number of samples per sensitive group, but not (3) option to have different base rate (i.e., proportion of positives) for each sensitive group?

Better documentation to come, but the short answer is that all three are included. (3) is randomized for each group, but can be overwritten by the user.

romanlutz · 2021-10-21T20:16:14Z

@coreysharris I just wanted to check in on this in case you need help on anything. Seems like the main missing piece is documentation. Let us know 🙂 We have weekly community calls in case you want to talk about anything: https://fairlearn.org/main/contributor_guide/index.html#community-calls

romanlutz · 2021-12-15T02:38:19Z

@coreysharris I'm checking in on a few open PRs to see if the author is still interested in continuing. If not, someone else can pick it up from here and continue. Of course, if you are still interested it's best that you continue even if it's not right away (no hurry!). Just let us know! If there's no response by 12/28 (two weeks from today) I'll assume others can pick it up.

coreysharris added 3 commits July 17, 2021 13:21

First draft of synthetic dataset creation

d5c8c1d

Preemptive cleanup

f5a1553

Fix seed parameter

3c6a079

romanlutz reviewed Jul 17, 2021

View reviewed changes

coreysharris commented Jul 17, 2021

View reviewed changes

fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved

fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved

fairlearn/datasets/__init__.py Outdated Show resolved Hide resolved

romanlutz reviewed Jul 17, 2021

View reviewed changes

fairlearn/datasets/synthetic.py Outdated Show resolved Hide resolved

coreysharris added 2 commits July 17, 2021 14:08

Fix RST title

3e8f574

Don't set a default seed

1d93760

romanlutz reviewed Jul 17, 2021

View reviewed changes

fairlearn/datasets/__init__.py Outdated Show resolved Hide resolved

examples/plot_synthetic.py Outdated Show resolved Hide resolved

examples/plot_synthetic.py Outdated Show resolved Hide resolved

Remove pie charts in example notebook

a3d2142

coreysharris added 5 commits July 17, 2021 14:15

Make n_features, n_informative into parameters

b198111

Move synthetic.py -> _synthetic.py

9855060

Fix copyright

d848727

Fix train/predict dataset usage

915f68c

Update changelog/contributors

17f3ea4

romanlutz reviewed Jul 17, 2021

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

coreysharris added 5 commits July 17, 2021 15:14

Add unit test

c5d371b

Fix typo

a27ae76

Fix flake8 issue

c775f4d

Clean things up

772b9b1

Allow custom labels

1b392cf

coreysharris added 2 commits July 17, 2021 17:32

Update test to match new settings

d056bbd

Fix copyright

353aa94

coreysharris changed the title ~~[WIP] Synthetic dataset creation~~ Synthetic dataset creation Jul 17, 2021

coreysharris requested a review from romanlutz July 17, 2021 23:21

Use sklearn's check_random_state

2b82966

coreysharris force-pushed the enh/synthetic-datasets branch from 989fee9 to 2b82966 Compare July 18, 2021 00:09

romanlutz reviewed Jul 18, 2021

View reviewed changes

romanlutz mentioned this pull request Sep 11, 2021

FEAT: Model comparison plot #947

Merged

6 tasks

coreysharris added 7 commits September 16, 2021 21:07

Get multiple feature support working

c623b2a

Fix feature ordering

7f58cce

Don't hardcode num samples

f54880c

Show two-feature example

1f7d3f5

Improve naming

5c9aba1

More cleanup

434ec78

Fix tests

b8b3ecf

coreysharris requested review from adrinjalali, riedgar-ms and romanlutz September 17, 2021 04:42

Merge branch 'main' into enh/synthetic-datasets

5b20f01

romanlutz reviewed Sep 17, 2021

View reviewed changes

hildeweerts requested changes Sep 20, 2021

View reviewed changes

coreysharris added 4 commits September 25, 2021 08:56

Improve changelog message

8632985

Improve docstrings

d7546ab

Shuffle

824430d

Improve notebook example

b1848e9

coreysharris added 3 commits September 26, 2021 07:47

Make init_configured_groups private

351f151

Add some helepr methods

84e62cd

Get tests passing

fd22427

hildeweerts mentioned this pull request Oct 6, 2021

ENH Add "adversarial debiasing" #973

Closed

romanlutz mentioned this pull request Oct 21, 2021

ENH: Incorporating folktables datasets from Ding et al. #984

Open

romanlutz mentioned this pull request Jan 18, 2022

Adding Thresholder and RejectOptionClassifier #997

Open

FEAT Synthetic dataset creation #907

Are you sure you want to change the base?

FEAT Synthetic dataset creation #907

Conversation

coreysharris commented Jul 17, 2021

romanlutz left a comment

Choose a reason for hiding this comment

romanlutz commented Jul 17, 2021

coreysharris commented Jul 17, 2021

romanlutz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreysharris commented Aug 26, 2021

romanlutz commented Aug 26, 2021

coreysharris commented Sep 17, 2021

romanlutz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hildeweerts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreysharris commented Sep 25, 2021

romanlutz commented Oct 21, 2021

romanlutz commented Dec 15, 2021