Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Creating a synthetic example dataset #793

Open
romanlutz opened this issue May 7, 2021 · 7 comments · May be fixed by #907
Open

ENH Creating a synthetic example dataset #793

romanlutz opened this issue May 7, 2021 · 7 comments · May be fixed by #907
Assignees
Labels
enhancement New feature or request help wanted

Comments

@romanlutz
Copy link
Member

This is based on a Gitter conversation with @adrinjalali @hildeweerts @MiroDudik where we agreed that it would be nice to have a synthetic dataset available for our examples. @adrinjalali suggested the following code using sklearn's make_classification:

rng = RandomState(seed=42)

X_women, y_women = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=1,
    random_state=rng,
)

X_men, y_men = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=2,
    random_state=rng,
)

X_unspecified, y_unspecified = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=4,
    n_classes=2,
    class_sep=0.5,
    random_state=rng,
)

X = np.r_[X_women, X_men, X_unspecified]
y = np.r_[y_women, y_men, y_unspecified]
gender = np.r_[["Woman"] * 500, ["Man"] * 500, ["Unspecified"] * 500].reshape(
    -1,
)

X_train, X_test, y_train, y_test, gender_train, gender_test = train_test_split(
    X, y, gender, test_size=0.3, random_state=rng
)

@MiroDudik suggested extending this to have at least 2 sensitive features and 1 control feature to allow us to use it in basically all our examples.

@fairlearn/fairlearn-maintainers any objection with putting this in the fairlearn.datasets module?

@romanlutz romanlutz added enhancement New feature or request help wanted labels May 7, 2021
@MiroDudik
Copy link
Member

I'd prefer to go with 4 categories: Women / Men / WriteIn / PreferNotSay... (see #792). This becomes slightly subtle, because "PreferNotSay" should be treated as "NaN" for the purposes of fairness evaluation (and I'm frankly not sure whether we deal with this properly everywhere... ehmm... perhaps we should have an issue to check that?).

We could also pick a different sensitive feature like age.

@adrinjalali
Copy link
Member

We could certainly give quite a few knobs for users to tune generating a synthetic dataset. Like how many categorical sensitive features, how many continues ones, etc. On the plus side, we could use these in our tests instead of downloading from openML.

@MiroDudik
Copy link
Member

Oh. I like the idea of a synthetic fully reproducible generator with a couple of knobs.

Not sure what others think about "fully reproducible"--I think it's really important, but it could be a bit tricky if we take dependencies on libraries that rely on random seed generators that might change. @adrinjalali -- do you have thoughts on this? (I might be pulling a cart before the horse here :-)

@Zuzah
Copy link

Zuzah commented May 18, 2021

Hey @romanlutz,

I'll take a stab at this issue and see where I can take it. I'll be tackling it as part of Pycon 2021 Sprint. Will utilize Scikit Learn based on the specs described above.

@adrinjalali
Copy link
Member

@MiroDudik there are so many packages which depend on numpy's random number generators that whenever they need to create something new, they create a new module/function. The old stuff has always been backward compatible generating the same values.

@romanlutz
Copy link
Member Author

I unassigned @Zuzah who wants to make it available to whoever has time at the moment. Please reply here if you'd like to pick it up!

@coreysharris
Copy link

Hi @romanlutz, I'd like to help here.

@coreysharris coreysharris linked a pull request Jul 17, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants