Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API options for Pandas output #20258

Closed
thomasjpfan opened this issue Jun 13, 2021 · 8 comments
Closed

API options for Pandas output #20258

thomasjpfan opened this issue Jun 13, 2021 · 8 comments
Labels
API Hard Hard level of difficulty

Comments

@thomasjpfan
Copy link
Member

thomasjpfan commented Jun 13, 2021

Related to:

This issue summarizes all the options for pandas with a normal data science use case:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])

In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.

Option 1: output kwargs in transform

All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:

# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")

Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.

Option 2: __init__ parameter

All transformers will accept an transform_output in __init__:

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)

Option 2b: Have a global config to transform_output

For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.

import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)

Option 3: Use SLEP 006

Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")

Option 3b: Have a global config for request

For a better user experience, we can have a global config:

import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")

Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux

@adrinjalali
Copy link
Member

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

I think from the user's perspective, option 2 makes more sense since it's not really a request.

Also, when I think about third party meta estimators, I'm not sure which option is better.

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2021

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

In the context of scikit-learn we have a workaround that works:

# remove when https://github.com/joblib/joblib/issues/1071 is fixed
def delayed(function):
"""Decorator used to capture the arguments of a function."""
@functools.wraps(function)
def delayed_function(*args, **kwargs):
return _FuncWrapper(function), args, kwargs
return delayed_function
class _FuncWrapper:
""""Load the global configuration before calling the function."""
def __init__(self, function):
self.function = function
self.config = get_config()
update_wrapper(self, self.function)
def __call__(self, *args, **kwargs):
with config_context(**self.config):
return self.function(*args, **kwargs)

@ogrisel
Copy link
Member

ogrisel commented Jun 15, 2021

I have the feeling that option 3 would be unnecessary verbose.

Option 2 and option 2b are not necessarily mutually exclusive no?

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@lorentzenchr
Copy link
Member

For larger pipelines, option 1 is my personal favorite as a user.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 15, 2021 via email

@thomasjpfan
Copy link
Member Author

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: #16772 . I think we decided not to go down the path of having a global config.

As for implementation, I would prefer not to hide it into a mixin and prefer something like #20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.

@lorentzenchr
Copy link
Member

Can we close as SLEP018 was accepted?

@thomasjpfan
Copy link
Member Author

I agree, we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Hard Hard level of difficulty
Projects
None yet
Development

No branches or pull requests

5 participants