Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Output Proposal Outline #23001

Closed
thomasjpfan opened this issue Mar 31, 2022 · 2 comments · Fixed by #23734
Closed

Pandas Output Proposal Outline #23001

thomasjpfan opened this issue Mar 31, 2022 · 2 comments · Fixed by #23734
Labels

Comments

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 31, 2022

With get_feature_names_out complete, I am currently reworking the SLEP for pandas output. I am thinking of only covering transformers in the SLEP to reduce the scope. This issue covers the complete idea for pandas output that covers all methods that return arrays: transform, predict, etc.

API Prototype

I put together a functional prototype of this API that you can explore in this colab notebook. Here is a rendered version of the demo. The demo includes the following use cases:

  • DataFrame output from a Single transformer
  • Column Transformer with DataFrame output
  • Feature selection based on column names with cross validation
  • Using HistGradientBoosting to select categories based on dtype
  • Text preprocessing with sparse data

Proposal

TLDR: The proposal is to add a set_output method to configure the output container. When set_output(transform="pandas") the output of the estimator is a pandas dataframe. In #16772, I have shown that sparse data will have a performance regression. To work around this, I propose set_output(transform="frame_or_sparse"), which returns a DataFrame for dense data and a custom SKCSRMatrix for sparse data. SKCSRMatrix is a subclass of csr_matrix so it will work with previous code.

See the rendered notebook to see the API in various use cases.

Future Extensions

These are items that is not in the SLEP

Future: Predictions

log_reg = LogisticRegression()
log_reg.set_output(
    predict_proba="pandas",
    predict="pandas",
    decision_function="pandas",
)
log_reg.fit(X_df, y)

# classes are the column names
X_pred = log_reg.predict_proba(X)

# categorical where classes are the categories
X_pred = log_reg.predict(X)

# binary case, series with name=classes_[1]
# multiclass case, dataframe with columns=classes_
X_pred = log_reg.decision_function(X)

Future: Pipeline with prediction

log_reg = make_pipeline(
    StandardScalar(),  # only uses `transform="pandas"`
    LogisticRegression(), # only uses `predict="pandas"`
).set_output(
    predict="pandas",
    transform="pandas",
)

log_reg.fit(X_df, y)

# DataFrame
y_pred = log_reg.predict(X_df)

CC @amueller @glemaitre

@cab938
Copy link

cab938 commented Apr 17, 2022

@thomasjpfan What about also adding a global value that could be set instead of having to ensure one sets this on each pipeline/transformer/estimator? E.g. sklearn.set_config(set_output='pandas') for everything to be set to pandas, or sklearn.set_config(set_output={'transform':'pandas'}) for just transformations to be set to output as pandas?

@thomasjpfan
Copy link
Member Author

A global configuration option would enable a better user experience and can work within the proposal. The biggest issue with global configuration is how it adds another layer of complexity for writing a third party transformer. We can likely mitigate the issue by providing enough utilities so that pandas output becomes easy to implement (that follows sklearn's global configuration).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants