You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With get_feature_names_out complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: transform, predict, etc.
Feature selection based on column names with cross validation
Using HistGradientBoosting to select categories based on dtype
Text preprocessing with sparse data
Proposal
TLDR: The proposal is to add a set_output method to configure the output container. When set_output(transform="pandas") the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I propose set_output(transform="frame_or_sparse"), which returns a DataFrame for dense data and a custom SKCSRMatrix for sparse data. SKCSRMatrix is a subclass of csr_matrix so it will work with previous code.
log_reg=LogisticRegression()
log_reg.set_output(
predict_proba="pandas",
predict="pandas",
decision_function="pandas",
)
log_reg.fit(X_df, y)
# classes are the column namesX_pred=log_reg.predict_proba(X)
# categorical where classes are the categoriesX_pred=log_reg.predict(X)
# binary case, series with name=classes_[1]# multiclass case, dataframe with columns=classes_X_pred=log_reg.decision_function(X)
Future: Pipeline with prediction
log_reg=make_pipeline(
StandardScalar(), # only uses `transform="pandas"`LogisticRegression(), # only uses `predict="pandas"`
).set_output(
predict="pandas",
transform="pandas",
)
log_reg.fit(X_df, y)
# DataFramey_pred=log_reg.predict(X_df)
@thomasjpfan What about also adding a global value that could be set instead of having to ensure one sets this on each pipeline/transformer/estimator? E.g. sklearn.set_config(set_output='pandas') for everything to be set to pandas, or sklearn.set_config(set_output={'transform':'pandas'}) for just transformations to be set to output as pandas?
A global configuration option would enable a better user experience and can work within the proposal. The biggest issue with global configuration is how it adds another layer of complexity for writing a third party transformer. We can likely mitigate the issue by providing enough utilities so that pandas output becomes easy to implement (that follows sklearn's global configuration).
With
get_feature_names_out
complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays:transform
,predict
, etc.API Prototype
I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:
Proposal
TLDR: The proposal is to add a
set_output
method to configure the output container. Whenset_output(transform="pandas")
the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I proposeset_output(transform="frame_or_sparse")
, which returns a DataFrame for dense data and a customSKCSRMatrix
for sparse data.SKCSRMatrix
is a subclass ofcsr_matrix
so it will work with previous code.See the rendered notebook to see the API in various use cases.
Future Extensions
These are items that is not in the SLEP
Future: Predictions
Future: Pipeline with prediction
CC @amueller @glemaitre
The text was updated successfully, but these errors were encountered: