New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Creating a synthetic example dataset #793
Comments
I'd prefer to go with 4 categories: Women / Men / WriteIn / PreferNotSay... (see #792). This becomes slightly subtle, because "PreferNotSay" should be treated as "NaN" for the purposes of fairness evaluation (and I'm frankly not sure whether we deal with this properly everywhere... ehmm... perhaps we should have an issue to check that?). We could also pick a different sensitive feature like age. |
We could certainly give quite a few knobs for users to tune generating a synthetic dataset. Like how many categorical sensitive features, how many continues ones, etc. On the plus side, we could use these in our tests instead of downloading from openML. |
Oh. I like the idea of a synthetic fully reproducible generator with a couple of knobs. Not sure what others think about "fully reproducible"--I think it's really important, but it could be a bit tricky if we take dependencies on libraries that rely on random seed generators that might change. @adrinjalali -- do you have thoughts on this? (I might be pulling a cart before the horse here :-) |
Hey @romanlutz, I'll take a stab at this issue and see where I can take it. I'll be tackling it as part of Pycon 2021 Sprint. Will utilize Scikit Learn based on the specs described above. |
@MiroDudik there are so many packages which depend on numpy's random number generators that whenever they need to create something new, they create a new module/function. The old stuff has always been backward compatible generating the same values. |
I unassigned @Zuzah who wants to make it available to whoever has time at the moment. Please reply here if you'd like to pick it up! |
Hi @romanlutz, I'd like to help here. |
This is based on a Gitter conversation with @adrinjalali @hildeweerts @MiroDudik where we agreed that it would be nice to have a synthetic dataset available for our examples. @adrinjalali suggested the following code using
sklearn
'smake_classification
:@MiroDudik suggested extending this to have at least 2 sensitive features and 1 control feature to allow us to use it in basically all our examples.
@fairlearn/fairlearn-maintainers any objection with putting this in the
fairlearn.datasets
module?The text was updated successfully, but these errors were encountered: