Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

Closed

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Toward #5523
Alternative to #16772

What does this implement/fix? Explain your changes.

Adds array_out="pandas" to the transformers in preprocessing module. Here are the decision decisions I made:

  1. The DataFrame returned by transform uses the index of the input.
  2. feature_names_in_ is a sequence of names and does not need to be a numpy array.
  3. There is no restriction on what type the feature names can be. By default, pd.DataFrame(X) uses integer column names and DictVectorizer.get_feature_names can output numerical feature names.

Any other comments?

  1. I noticed that stateless estimators such as Binarizer now have state because of n_features_in_ and now feature_names_in_. Should Binarizer.transform(df_test, array_out="pandas") use the column names of df_test if Binarizer.fit is not called?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should get_feature_names or get_feature_names_out be deprecated and become private?

Overall it looks quite good to me.

X : array-like
The input samples.
reset : bool, default=True
If True, the `n_feature_names_` attribute is set to the feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If True, the `n_feature_names_` attribute is set to the feature
If True, the `feature_names_in_` attribute is set to the feature

Comment on lines +728 to +729
returned. If "default", an array-like without feature names is
returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably worth explaining why this is not a numpy array, and sometimes a sparse array.

@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params):
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
fitted = self.fit(X, **fit_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is fitted ever not self?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that also confused me. I think it's always self, right? Or at least should be.

@@ -269,7 +269,8 @@ def test_fit_docstring_attributes(name, Estimator):
est.fit(X, y)

skipped_attributes = {'x_scores_', # For PLS, TODO remove in 1.1
'y_scores_'} # For PLS, TODO remove in 1.1
'y_scores_', # For PLS, TODO remove in 1.1
'feature_names_in_'} # Ignore for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we gonna fix this in this PR? We don't have to, but if we don't should create a separate issue for a sprint or something.

return get_feature_names_out_callable


def _make_array_out(X_out, index, get_feature_names_out, *,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _make_array_out(X_out, index, get_feature_names_out, *,
def _make_array_out(X_out, *, index, get_feature_names_out,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, wouldn't it make more sense for this function to accept the feature_names_out instead of get_feature names_out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, because of the way it's used in the decorator?


def _make_array_out(X_out, index, get_feature_names_out, *,
array_out="default"):
"""Construct array container based on global configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Construct array container based on global configuration.
"""Construct array container based on the value of `array_out`.

I think?

Comment on lines +64 to +65
returned. If "default", an array-like without feature names is
returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
returned. If "default", an array-like without feature names is
returned.
returned. If "default", `X_out` is returned unmodified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should we actually make sure the output is not a dataframe?

if array_out == "default":
return X_out

estimator, X_orig = args[0], args[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly question, what would these values be if the user calls est.transform(array_out='dataframe', X=X)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple unpacking error? args is only est in that case, right?

@thomasjpfan
Copy link
Member Author

From the dev meeting: We discussed how we want to see how the API of this PR compares to using a __init__ parameter. We already have some __init__ parameters that change the output of transform for example: sparse_threshold in ColumnTransformer.

I'll come up this other implementation within a week and write up how the API compares to this PR when used with meta-estimators.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was vaguely remembering having an implementation with a constructor parameter, but I think it was actually a constructor parameter that gets the input feature names, not whether to produce pandas dataframes lol... there have been sooo many iterations of this now. This looks good though.

@@ -438,6 +438,10 @@ Changelog
- |Feature| :class:`preprocessing.OrdinalEncoder` supports passing through
missing values by default. :pr:`19069` by `Thomas Fan`_.

- |Feature| Transformers in the :mod:`sklearn.preprocessing` have a `array_out`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is array_out a good name? that seems to imply... arrays. output_format? or output?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with either output or output_format

@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params):
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
fitted = self.fit(X, **fit_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that also confused me. I think it's always self, right? Or at least should be.

return fitted.transform(X)

# array_out != "default"
transform_params = inspect.signature(fitted.transform).parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just pass it, or is the error message not good in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "can we just pass it"? If we don't pass it when it doesn't exist, it's a silent bug, and I think just passing it gives an error message which can be confusing to many people.

if array_out == "default":
return X_out

estimator, X_orig = args[0], args[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple unpacking error? args is only est in that case, right?

@thomasjpfan
Copy link
Member Author

should get_feature_names or get_feature_names_out be deprecated and become private?

There is still a use case for get_feature_names when the output is a sparse matrix. There is still significant overhead between the scipy sparse and pandas dataframe roundtrip.

get_feature_names means we do not need to introduce a new API to get the output feature names:

poly = PolynomialFeatures().fit(X_df)

# Uses `feature_names_in_` by default as input
poly.get_feature_names()

# without `get_feature_names` a user would need to pass in some dummy data
# to get the feature names:
poly.transform(X_df).columns

# Another API would be to store `feature_names_out_`, which can be a property
# so we do not need to store another array of strings:
poly.feature_names_out_

I prefer a feature_names_out_ property, almost like n_features_out_ SLEP013.

@amueller
Copy link
Member

I don't have a strong opinion between poly.get_feature_names() and poly.feature_names_out_, any way forward would be good ;)

amueller
amueller previously approved these changes Jul 9, 2021
@lorentzenchr
Copy link
Member

This is superseded by #23734, right? Can we close then?

@thomasjpfan
Copy link
Member Author

Yes this has been superseded by #23734 and we can close.

@lorentzenchr lorentzenchr added the Superseded PR has been replace by a newer PR label Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing module:utils Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants