ENH Creating a synthetic example dataset #793

romanlutz · 2021-05-07T23:30:37Z

This is based on a Gitter conversation with @adrinjalali @hildeweerts @MiroDudik where we agreed that it would be nice to have a synthetic dataset available for our examples. @adrinjalali suggested the following code using sklearn's make_classification:

rng = RandomState(seed=42)

X_women, y_women = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=1,
    random_state=rng,
)

X_men, y_men = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=2,
    random_state=rng,
)

X_unspecified, y_unspecified = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=0.5,
    random_state=rng,
)

X = np.r_[X_women, X_men, X_unspecified]
y = np.r_[y_women, y_men, y_unspecified]
gender = np.r_[["Woman"] * 500, ["Man"] * 500, ["Unspecified"] * 500].reshape(
    -1,
)

X_train, X_test, y_train, y_test, gender_train, gender_test = train_test_split(
    X, y, gender, test_size=0.3, random_state=rng
)

@MiroDudik suggested extending this to have at least 2 sensitive features and 1 control feature to allow us to use it in basically all our examples.

@fairlearn/fairlearn-maintainers any objection with putting this in the fairlearn.datasets module?

The text was updated successfully, but these errors were encountered:

MiroDudik · 2021-05-07T23:45:40Z

I'd prefer to go with 4 categories: Women / Men / WriteIn / PreferNotSay... (see #792). This becomes slightly subtle, because "PreferNotSay" should be treated as "NaN" for the purposes of fairness evaluation (and I'm frankly not sure whether we deal with this properly everywhere... ehmm... perhaps we should have an issue to check that?).

We could also pick a different sensitive feature like age.

adrinjalali · 2021-05-08T07:06:59Z

We could certainly give quite a few knobs for users to tune generating a synthetic dataset. Like how many categorical sensitive features, how many continues ones, etc. On the plus side, we could use these in our tests instead of downloading from openML.

MiroDudik · 2021-05-12T13:15:32Z

Oh. I like the idea of a synthetic fully reproducible generator with a couple of knobs.

Not sure what others think about "fully reproducible"--I think it's really important, but it could be a bit tricky if we take dependencies on libraries that rely on random seed generators that might change. @adrinjalali -- do you have thoughts on this? (I might be pulling a cart before the horse here :-)

Zuzah · 2021-05-18T02:11:39Z

Hey @romanlutz,

I'll take a stab at this issue and see where I can take it. I'll be tackling it as part of Pycon 2021 Sprint. Will utilize Scikit Learn based on the specs described above.

adrinjalali · 2021-05-31T13:03:43Z

@MiroDudik there are so many packages which depend on numpy's random number generators that whenever they need to create something new, they create a new module/function. The old stuff has always been backward compatible generating the same values.

romanlutz · 2021-06-23T17:30:26Z

I unassigned @Zuzah who wants to make it available to whoever has time at the moment. Please reply here if you'd like to pick it up!

coreysharris · 2021-07-17T16:33:58Z

Hi @romanlutz, I'd like to help here.

romanlutz added enhancement New feature or request help wanted labels May 7, 2021

romanlutz assigned Zuzah May 18, 2021

romanlutz mentioned this issue Jun 21, 2021

TST find and fix slow tests #850

Open

romanlutz unassigned Zuzah Jun 23, 2021

romanlutz mentioned this issue Jul 8, 2021

Add control features to metric plots #668

Open

romanlutz assigned coreysharris Jul 17, 2021

coreysharris linked a pull request Jul 17, 2021 that will close this issue

FEAT Synthetic dataset creation #907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Creating a synthetic example dataset #793

ENH Creating a synthetic example dataset #793

romanlutz commented May 7, 2021

MiroDudik commented May 7, 2021

adrinjalali commented May 8, 2021

MiroDudik commented May 12, 2021

Zuzah commented May 18, 2021

adrinjalali commented May 31, 2021

romanlutz commented Jun 23, 2021

coreysharris commented Jul 17, 2021

ENH Creating a synthetic example dataset #793

ENH Creating a synthetic example dataset #793

Comments

romanlutz commented May 7, 2021

MiroDudik commented May 7, 2021

adrinjalali commented May 8, 2021

MiroDudik commented May 12, 2021

Zuzah commented May 18, 2021

adrinjalali commented May 31, 2021

romanlutz commented Jun 23, 2021

coreysharris commented Jul 17, 2021