-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API options for Pandas output #20258
Comments
The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we? I think from the user's perspective, option 2 makes more sense since it's not really a request. Also, when I think about third party meta estimators, I'm not sure which option is better. |
In the context of scikit-learn we have a workaround that works: scikit-learn/sklearn/utils/fixes.py Lines 187 to 205 in 6d67937
|
I have the feeling that option 3 would be unnecessary verbose. Option 2 and option 2b are not necessarily mutually exclusive no? From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public |
For larger pipelines, option 1 is my personal favorite as a user. |
Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.
I agree with your analysis.
Would it be interesting to have a version of option 1 where the default behavior is controlled by a global flag and is overridden by passing an argument to the transformer?
|
@ogrisel Option 2b without the As for implementation, I would prefer not to hide it into a mixin and prefer something like #20100. The idea is to use |
Can we close as SLEP018 was accepted? |
I agree, we can close this issue. |
Related to:
This issue summarizes all the options for pandas with a normal data science use case:
In all of the following options,
pipe[-1].feature_names_in_
is used to get the feature names used inLogisticRegression
. All options requirefeature_names_in_
to enforce column name consistency betweenfit
andtransform
.Option 1:
output
kwargs intransform
All transformers will accept a
output='pandas'
intransform
. To configure transformers to output dataframes duringfit
:Pipeline will pass
output="pandas"
to every transform method duringfit
. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to passoutput="pandas"
to everytransformer.transform
.Option 2:
__init__
parameterAll transformers will accept an
transform_output
in__init__
:Option 2b: Have a global config to
transform_output
For a better user experience, we can have a global config. By default,
transform_output
is set to'global'
in all transformers.Option 3: Use SLEP 006
Have all transformers request
output
. Similiar to Option 1, every transformer needs aoutput='pandas'
kwarg intransform
.Option 3b: Have a global config for request
For a better user experience, we can have a global config:
Summary
Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.
CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux
The text was updated successfully, but these errors were encountered: