Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: OneHotEncoder copying Feature names #18753

Closed
d-kleine opened this issue Nov 3, 2020 · 1 comment
Closed

Feature request: OneHotEncoder copying Feature names #18753

d-kleine opened this issue Nov 3, 2020 · 1 comment

Comments

@d-kleine
Copy link

d-kleine commented Nov 3, 2020

Hi,

I would like to ask if it would be possible to add a feature to sklearn's OneHotEncoder:

  1. to automatically create feature names in such way columns were named in the input data
  2. one-hot encoded features are replacing input columns in original dataframe
  3. that values of one-hot encoded features should always be 0 and 1 (easier to read; currently, they are floats 0.0 and 1.0)

Describe the workflow you want to enable

I would like to have an more comfortable way to applying one-hot enconding as implemented in pdpipe (using sklearn OHE):

#example data
import pandas as pd
df_example = pd.DataFrame({'groups': ["test1", "test2", "test3", "test4"],
                          'population': [10000, 20000, 30000, 40000]})
df_example

# copying example data for demonstation
df_example_pdp = df_example.copy() 
df_example_skl = df_example.copy()

# pdpipe OHE 
import pdpipe as pdp 
onehot = pdp.OneHotEncode(['groups'], drop_first=False) 
df_example_pdp = onehot.fit_transform(df_example_pdp) 
df_example_pdp.head()

Describe your proposed solution

Output should look like this:

# sklearn OHE 
from sklearn.preprocessing import OneHotEncoder 
enc = OneHotEncoder(handle_unknown='ignore')
enc_fitted = enc.fit_transform(df_example_skl[['groups']])
column_name = enc.get_feature_names(['groups'])
enc_df = pd.DataFrame(enc_fitted.toarray().astype(int), columns=column_name)
df_example_skl = df_example_skl.join(enc_df) 
df_example_skl.head()

Describe alternatives you've considered, if relevant

pdpipe provides an excellent way: features will be one-hot encoded and columns will be replaced with one-hot encoded features. Great and simple labeling for understanding, integers are easier to read than floats, small and easy-to-read syntax. Hope to see this in sklearn's OneHotEncoder directly.

Additional context

Tested with sklearn's newest version 0.23.2 and older

@NicolasHug
Copy link
Member

Hi,

to automatically create feature names in such way columns were named in the input data

Feature names propagation is tracked in scikit-learn/enhancement_proposals#48 and related PRs / issues

one-hot encoded features are replacing input columns in original dataframe

Unlikely to happen: transform doesn't change the data inplace. We're thinking about outputting dataframe-like objects (see other SLEPs), but that's slightly different

that values of one-hot encoded features should always be 0 and 1 (easier to read; currently, they are floats 0.0 and 1.0)

You can select a dtype with the dtype parameter. However, if you use the OHE in a ColumnTransformer and the final array has a mix of real-valued data and integer data, it will be upcasted to floats.

Note that for now, numpy array are first class citizens of scikit-learn, not dataframe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants