Feature request: OneHotEncoder copying Feature names #18753

d-kleine · 2020-11-03T13:52:04Z

Hi,

I would like to ask if it would be possible to add a feature to sklearn's OneHotEncoder:

to automatically create feature names in such way columns were named in the input data
one-hot encoded features are replacing input columns in original dataframe
that values of one-hot encoded features should always be 0 and 1 (easier to read; currently, they are floats 0.0 and 1.0)

Describe the workflow you want to enable

I would like to have an more comfortable way to applying one-hot enconding as implemented in pdpipe (using sklearn OHE):

#example data
import pandas as pd
df_example = pd.DataFrame({'groups': ["test1", "test2", "test3", "test4"],
                          'population': [10000, 20000, 30000, 40000]})
df_example

# copying example data for demonstation
df_example_pdp = df_example.copy() 
df_example_skl = df_example.copy()

# pdpipe OHE 
import pdpipe as pdp 
onehot = pdp.OneHotEncode(['groups'], drop_first=False) 
df_example_pdp = onehot.fit_transform(df_example_pdp) 
df_example_pdp.head()

Describe your proposed solution

Output should look like this:

# sklearn OHE 
from sklearn.preprocessing import OneHotEncoder 
enc = OneHotEncoder(handle_unknown='ignore')
enc_fitted = enc.fit_transform(df_example_skl[['groups']])
column_name = enc.get_feature_names(['groups'])
enc_df = pd.DataFrame(enc_fitted.toarray().astype(int), columns=column_name)
df_example_skl = df_example_skl.join(enc_df) 
df_example_skl.head()

Describe alternatives you've considered, if relevant

pdpipe provides an excellent way: features will be one-hot encoded and columns will be replaced with one-hot encoded features. Great and simple labeling for understanding, integers are easier to read than floats, small and easy-to-read syntax. Hope to see this in sklearn's OneHotEncoder directly.

Additional context

Tested with sklearn's newest version 0.23.2 and older

NicolasHug · 2020-11-03T15:33:03Z

Hi,

to automatically create feature names in such way columns were named in the input data

Feature names propagation is tracked in scikit-learn/enhancement_proposals#48 and related PRs / issues

one-hot encoded features are replacing input columns in original dataframe

Unlikely to happen: transform doesn't change the data inplace. We're thinking about outputting dataframe-like objects (see other SLEPs), but that's slightly different

that values of one-hot encoded features should always be 0 and 1 (easier to read; currently, they are floats 0.0 and 1.0)

You can select a dtype with the dtype parameter. However, if you use the OHE in a ColumnTransformer and the final array has a mix of real-valued data and integer data, it will be upcasted to floats.

Note that for now, numpy array are first class citizens of scikit-learn, not dataframe.

d-kleine added the New Feature label Nov 3, 2020

NicolasHug closed this as completed Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: OneHotEncoder copying Feature names #18753

Feature request: OneHotEncoder copying Feature names #18753

d-kleine commented Nov 3, 2020 •

edited

NicolasHug commented Nov 3, 2020

Feature request: OneHotEncoder copying Feature names #18753

Feature request: OneHotEncoder copying Feature names #18753

Comments

d-kleine commented Nov 3, 2020 • edited

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

NicolasHug commented Nov 3, 2020

d-kleine commented Nov 3, 2020 •

edited