Pandas Output Proposal Outline #23001

thomasjpfan · 2022-03-31T03:44:46Z

With get_feature_names_out complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: transform, predict, etc.

API Prototype

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

DataFrame output from a Single transformer
Column Transformer with DataFrame output
Feature selection based on column names with cross validation
Using HistGradientBoosting to select categories based on dtype
Text preprocessing with sparse data

Proposal

TLDR: The proposal is to add a set_output method to configure the output container. When set_output(transform="pandas") the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I propose set_output(transform="frame_or_sparse"), which returns a DataFrame for dense data and a custom SKCSRMatrix for sparse data. SKCSRMatrix is a subclass of csr_matrix so it will work with previous code.

See the rendered notebook to see the API in various use cases.

Future Extensions

These are items that is not in the SLEP

Future: Predictions

log_reg = LogisticRegression()
log_reg.set_output(
    predict_proba="pandas",
    predict="pandas",
    decision_function="pandas",
)
log_reg.fit(X_df, y)

# classes are the column names
X_pred = log_reg.predict_proba(X)

# categorical where classes are the categories
X_pred = log_reg.predict(X)

# binary case, series with name=classes_[1]
# multiclass case, dataframe with columns=classes_
X_pred = log_reg.decision_function(X)

Future: Pipeline with prediction

log_reg = make_pipeline(
    StandardScalar(),  # only uses `transform="pandas"`
    LogisticRegression(), # only uses `predict="pandas"`
).set_output(
    predict="pandas",
    transform="pandas",
)

log_reg.fit(X_df, y)

# DataFrame
y_pred = log_reg.predict(X_df)

CC @amueller @glemaitre

The text was updated successfully, but these errors were encountered:

cab938 · 2022-04-17T16:27:33Z

@thomasjpfan What about also adding a global value that could be set instead of having to ensure one sets this on each pipeline/transformer/estimator? E.g. sklearn.set_config(set_output='pandas') for everything to be set to pandas, or sklearn.set_config(set_output={'transform':'pandas'}) for just transformations to be set to output as pandas?

thomasjpfan · 2022-04-17T18:29:40Z

A global configuration option would enable a better user experience and can work within the proposal. The biggest issue with global configuration is how it adds another layer of complexity for writing a third party transformer. We can likely mitigate the issue by providing enough utilities so that pandas output becomes easy to implement (that follows sklearn's global configuration).

thomasjpfan added the API label Mar 31, 2022

jeremiedbb mentioned this issue Apr 1, 2022

Dataframe also as a sklearn transform output #23013

Closed

avm19 mentioned this issue Apr 5, 2022

Pandas in, Pandas out? #5523

Closed

thomasjpfan added the Needs Decision - API label May 12, 2022

ogrisel mentioned this issue May 16, 2022

Path for pluggable low-level computational routines #22438

Open

thomasjpfan mentioned this issue Jun 23, 2022

ENH Introduces set_output API for pandas output #23734

Merged

lorentzenchr closed this as completed in #23734 Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Output Proposal Outline #23001

Pandas Output Proposal Outline #23001

thomasjpfan commented Mar 31, 2022 •

edited

cab938 commented Apr 17, 2022

thomasjpfan commented Apr 17, 2022

Pandas Output Proposal Outline #23001

Pandas Output Proposal Outline #23001

Comments

thomasjpfan commented Mar 31, 2022 • edited

API Prototype

Proposal

Future Extensions

Future: Predictions

Future: Pipeline with prediction

cab938 commented Apr 17, 2022

thomasjpfan commented Apr 17, 2022

thomasjpfan commented Mar 31, 2022 •

edited