API options for Pandas output #20258

thomasjpfan · 2021-06-13T21:57:28Z

Related to:

Pandas in, Pandas out? #5523 pandas in, pandas out
Support standard data science use-case #10603 typical data science use case
ENH Adds array_out="pandas" to transformers in preprocessing module #20100 array out in preprocessing
ENH ColumnTransformer.transform returns dataframes when transformers output them #20110 output dataframes in column transformer

This issue summarizes all the options for pandas with a normal data science use case:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])

In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.

Option 1: `output` kwargs in `transform`

All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:

# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")

Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.

Option 2: `init` parameter

All transformers will accept an transform_output in __init__:

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)

Option 2b: Have a global config to `transform_output`

For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.

import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)

Option 3: Use SLEP 006

Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")

Option 3b: Have a global config for request

For a better user experience, we can have a global config:

import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")

Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux

The text was updated successfully, but these errors were encountered:

adrinjalali · 2021-06-14T16:45:27Z

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

I think from the user's perspective, option 2 makes more sense since it's not really a request.

Also, when I think about third party meta estimators, I'm not sure which option is better.

ogrisel · 2021-06-15T08:40:25Z

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

In the context of scikit-learn we have a workaround that works:

scikit-learn/sklearn/utils/fixes.py

Lines 187 to 205 in 6d67937

    
           # remove when https://github.com/joblib/joblib/issues/1071 is fixed 
        
           def delayed(function): 
        
               """Decorator used to capture the arguments of a function.""" 
        
               @functools.wraps(function) 
        
               def delayed_function(*args, **kwargs): 
        
                   return _FuncWrapper(function), args, kwargs 
        
               return delayed_function 
        
           class _FuncWrapper: 
        
               """"Load the global configuration before calling the function.""" 
        
               def __init__(self, function): 
        
                   self.function = function 
        
                   self.config = get_config() 
        
                   update_wrapper(self, self.function) 
        
               def __call__(self, *args, **kwargs): 
        
                   with config_context(**self.config): 
        
                       return self.function(*args, **kwargs)

ogrisel · 2021-06-15T08:49:17Z

I have the feeling that option 3 would be unnecessary verbose.

Option 2 and option 2b are not necessarily mutually exclusive no?

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

lorentzenchr · 2021-06-15T09:37:50Z

For larger pipelines, option 1 is my personal favorite as a user.

GaelVaroquaux · 2021-06-15T11:27:14Z

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

I agree with your analysis. Would it be interesting to have a version of option 1 where the default behavior is controlled by a global flag and is overridden by passing an argument to the transformer?

thomasjpfan · 2021-06-26T14:46:09Z

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: #16772 . I think we decided not to go down the path of having a global config.

As for implementation, I would prefer not to hide it into a mixin and prefer something like #20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.

lorentzenchr · 2022-09-16T07:28:14Z

Can we close as SLEP018 was accepted?

thomasjpfan · 2022-09-26T01:34:21Z

I agree, we can close this issue.

ogrisel mentioned this issue Jun 17, 2021

FIX ColumnTransformer raise TypeError when remainder columns have incompatible dtype #20287

Closed

thomasjpfan added API Hard Hard level of difficulty Needs Decision - API labels Dec 10, 2021

ogrisel mentioned this issue Dec 31, 2021

attribute error when using pipeline object to access feature names #22093

Closed

thomasjpfan closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API options for Pandas output #20258

API options for Pandas output #20258

thomasjpfan commented Jun 13, 2021 •

edited by glemaitre

adrinjalali commented Jun 14, 2021

ogrisel commented Jun 15, 2021

ogrisel commented Jun 15, 2021

lorentzenchr commented Jun 15, 2021

GaelVaroquaux commented Jun 15, 2021 via email

thomasjpfan commented Jun 26, 2021

lorentzenchr commented Sep 16, 2022

thomasjpfan commented Sep 26, 2022

API options for Pandas output #20258

API options for Pandas output #20258

Comments

thomasjpfan commented Jun 13, 2021 • edited by glemaitre

Option 1: output kwargs in transform

Option 2: __init__ parameter

Option 2b: Have a global config to transform_output

Option 3: Use SLEP 006

Option 3b: Have a global config for request

Summary

adrinjalali commented Jun 14, 2021

ogrisel commented Jun 15, 2021

ogrisel commented Jun 15, 2021

lorentzenchr commented Jun 15, 2021

GaelVaroquaux commented Jun 15, 2021 via email

thomasjpfan commented Jun 26, 2021

lorentzenchr commented Sep 16, 2022

thomasjpfan commented Sep 26, 2022

thomasjpfan commented Jun 13, 2021 •

edited by glemaitre

Option 1: `output` kwargs in `transform`

Option 2: `init` parameter

Option 2b: Have a global config to `transform_output`