Add missing values and categorical features when generating datasets #28952

lcrmorin · 2024-05-05T08:07:08Z

Describe the workflow you want to enable

I am often using random datasets (typically with make_classification). However I often find myself having to add more realistic features to the dataset:

missing data, sometime just to test the pipeline (missing at random would be fine), or sometimes to look for more complex phenomenons (missingnes not at random, possibly depending on the target)
categorical: categoricals variables often need to be handled specifically. I usually introduce categoricals with binning a continuous value, then transforming to strings.
It would be nice to have both of those in datasets generation.

Describe your proposed solution

Introduce parameters to allow for generation of missing data (proportion of missingness, type of missingness - at random, not at random).
Introduce parameters to allow for generation of categorical features (number of features, type of repartition in categories - even - uneven - pareto.

Describe alternatives you've considered, if relevant

I usually handle this by hand.

Additional context

Could be used to illustrate imputing techniques, encoding techniques.

oasidorshin · 2024-05-06T08:19:11Z

@lcrmorin This would be great for testing! I would also suggest adding infinities as possible values, bcs they also break stuff quite often. Also, if randomly generated, making sure to always include at least one NaN and inf value

AK3847 · 2024-05-06T19:54:42Z

@lcrmorin I suggest adding a noise function or something similar which can generate structured randomness so as to make some sense in data and not pseudo-randomness. Perhaps something like Perlin Noise?

glemaitre · 2024-05-14T13:25:16Z

Regarding the missing values I recall the following issues/PRs: #6284 / #7084. It seems that the consensus was to have something similar to the ampute R package.

I almost a similar discussion for categorical features but I could not find. For sure, it would be handy to have those two parameters even though we could limit the complexity (e.g. only have a single missingness pattern)

glemaitre · 2024-05-16T16:57:04Z

Regarding the categorical features, we have the following related issue: #12433

lcrmorin added Needs Triage Issue requires triage New Feature labels May 5, 2024

glemaitre removed the Needs Triage Issue requires triage label May 14, 2024

glemaitre changed the title ~~Improve random datasets~~ Add missing values and categorical features when generating datasets May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing values and categorical features when generating datasets #28952

Add missing values and categorical features when generating datasets #28952

lcrmorin commented May 5, 2024

oasidorshin commented May 6, 2024

AK3847 commented May 6, 2024

glemaitre commented May 14, 2024

glemaitre commented May 16, 2024

Add missing values and categorical features when generating datasets #28952

Add missing values and categorical features when generating datasets #28952

Comments

lcrmorin commented May 5, 2024

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

oasidorshin commented May 6, 2024

AK3847 commented May 6, 2024

glemaitre commented May 14, 2024

glemaitre commented May 16, 2024