ENH Adds array_out="pandas" to transformers in preprocessing module #20100

thomasjpfan · 2021-05-16T22:20:00Z

Reference Issues/PRs

Toward #5523
Alternative to #16772

What does this implement/fix? Explain your changes.

Adds array_out="pandas" to the transformers in preprocessing module. Here are the decision decisions I made:

The DataFrame returned by transform uses the index of the input.
feature_names_in_ is a sequence of names and does not need to be a numpy array.
There is no restriction on what type the feature names can be. By default, pd.DataFrame(X) uses integer column names and DictVectorizer.get_feature_names can output numerical feature names.

Any other comments?

I noticed that stateless estimators such as Binarizer now have state because of n_features_in_ and now feature_names_in_. Should Binarizer.transform(df_test, array_out="pandas") use the column names of df_test if Binarizer.fit is not called?

adrinjalali

should get_feature_names or get_feature_names_out be deprecated and become private?

Overall it looks quite good to me.

adrinjalali · 2021-05-31T10:00:41Z

sklearn/base.py

+        X : array-like
+            The input samples.
+        reset : bool, default=True
+            If True, the `n_feature_names_` attribute is set to the feature


Suggested change

If True, the `n_feature_names_` attribute is set to the feature

If True, the `feature_names_in_` attribute is set to the feature

adrinjalali · 2021-05-31T10:02:41Z

sklearn/base.py

+            returned. If "default", an array-like without feature names is
+            returned.


it's probably worth explaining why this is not a numpy array, and sometimes a sparse array.

adrinjalali · 2021-05-31T10:03:37Z

sklearn/base.py

@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params):
        # method is possible for a given clustering algorithm
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
-            return self.fit(X, **fit_params).transform(X)
+            fitted = self.fit(X, **fit_params)


is fitted ever not self?

Yes that also confused me. I think it's always self, right? Or at least should be.

adrinjalali · 2021-05-31T12:17:10Z

sklearn/tests/test_docstring_parameters.py

@@ -269,7 +269,8 @@ def test_fit_docstring_attributes(name, Estimator):
        est.fit(X, y)

    skipped_attributes = {'x_scores_',  # For PLS, TODO remove in 1.1
-                          'y_scores_'}  # For PLS, TODO remove in 1.1
+                          'y_scores_',  # For PLS, TODO remove in 1.1
+                          'feature_names_in_'}  # Ignore for now


are we gonna fix this in this PR? We don't have to, but if we don't should create a separate issue for a sprint or something.

adrinjalali · 2021-05-31T12:20:36Z

sklearn/utils/_array_out.py

+    return get_feature_names_out_callable
+
+
+def _make_array_out(X_out, index, get_feature_names_out, *,


Suggested change

def _make_array_out(X_out, index, get_feature_names_out, *,

def _make_array_out(X_out, *, index, get_feature_names_out,

also, wouldn't it make more sense for this function to accept the feature_names_out instead of get_feature names_out?

I guess not, because of the way it's used in the decorator?

adrinjalali · 2021-05-31T12:24:04Z

sklearn/utils/_array_out.py

+
+def _make_array_out(X_out, index, get_feature_names_out, *,
+                    array_out="default"):
+    """Construct array container based on global configuration.


Suggested change

"""Construct array container based on global configuration.

"""Construct array container based on the value of `array_out`.

I think?

adrinjalali · 2021-05-31T12:24:48Z

sklearn/utils/_array_out.py

+        returned. If "default", an array-like without feature names is
+        returned.


Suggested change

returned. If "default", an array-like without feature names is

returned.

returned. If "default", `X_out` is returned unmodified.

or should we actually make sure the output is not a dataframe?

adrinjalali · 2021-05-31T12:45:41Z

sklearn/utils/_array_out.py

+            if array_out == "default":
+                return X_out
+
+            estimator, X_orig = args[0], args[1]


silly question, what would these values be if the user calls est.transform(array_out='dataframe', X=X)?

tuple unpacking error? args is only est in that case, right?

thomasjpfan · 2021-06-04T01:07:40Z

From the dev meeting: We discussed how we want to see how the API of this PR compares to using a __init__ parameter. We already have some __init__ parameters that change the output of transform for example: sparse_threshold in ColumnTransformer.

I'll come up this other implementation within a week and write up how the API compares to this PR when used with meta-estimators.

amueller

I was vaguely remembering having an implementation with a constructor parameter, but I think it was actually a constructor parameter that gets the input feature names, not whether to produce pandas dataframes lol... there have been sooo many iterations of this now. This looks good though.

amueller · 2021-06-11T00:49:48Z

doc/whats_new/v1.0.rst

@@ -438,6 +438,10 @@ Changelog
 - |Feature| :class:`preprocessing.OrdinalEncoder` supports passing through
  missing values by default. :pr:`19069` by `Thomas Fan`_.

+- |Feature| Transformers in the :mod:`sklearn.preprocessing` have a `array_out`


is array_out a good name? that seems to imply... arrays. output_format? or output?

I'd be happy with either output or output_format

amueller · 2021-06-11T00:53:36Z

sklearn/base.py

@@ -706,10 +740,20 @@ def fit_transform(self, X, y=None, **fit_params):
        # method is possible for a given clustering algorithm
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
-            return self.fit(X, **fit_params).transform(X)
+            fitted = self.fit(X, **fit_params)


Yes that also confused me. I think it's always self, right? Or at least should be.

amueller · 2021-06-11T00:53:53Z

sklearn/base.py

+            return fitted.transform(X)
+
+        # array_out != "default"
+        transform_params = inspect.signature(fitted.transform).parameters


Can we not just pass it, or is the error message not good in that case?

do you mean "can we just pass it"? If we don't pass it when it doesn't exist, it's a silent bug, and I think just passing it gives an error message which can be confusing to many people.

amueller · 2021-06-11T00:58:10Z

sklearn/utils/_array_out.py

+            if array_out == "default":
+                return X_out
+
+            estimator, X_orig = args[0], args[1]


tuple unpacking error? args is only est in that case, right?

thomasjpfan · 2021-06-13T00:20:34Z

should get_feature_names or get_feature_names_out be deprecated and become private?

There is still a use case for get_feature_names when the output is a sparse matrix. There is still significant overhead between the scipy sparse and pandas dataframe roundtrip.

get_feature_names means we do not need to introduce a new API to get the output feature names:

poly = PolynomialFeatures().fit(X_df)

# Uses `feature_names_in_` by default as input
poly.get_feature_names()

# without `get_feature_names` a user would need to pass in some dummy data
# to get the feature names:
poly.transform(X_df).columns

# Another API would be to store `feature_names_out_`, which can be a property
# so we do not need to store another array of strings:
poly.feature_names_out_

I prefer a feature_names_out_ property, almost like n_features_out_ SLEP013.

amueller · 2021-06-26T18:01:58Z

I don't have a strong opinion between poly.get_feature_names() and poly.feature_names_out_, any way forward would be good ;)

misclick lol

lorentzenchr · 2022-09-16T07:55:26Z

This is superseded by #23734, right? Can we close then?

thomasjpfan · 2022-09-16T14:38:27Z

Yes this has been superseded by #23734 and we can close.

ENH Adds array_out=pandas to transformers in preprocessing module

b9281b7

github-actions bot added module:preprocessing module:utils labels May 16, 2021

thomasjpfan mentioned this pull request May 16, 2021

[WIP] Feature names with pandas or xarray data structures #16772

Closed

thomasjpfan added 2 commits May 18, 2021 23:05

Merge remote-tracking branch 'upstream/main' into array_out_transform

e6e8007

DOC Adds whats new

737e149

adrinjalali reviewed May 31, 2021

View reviewed changes

amueller reviewed Jun 11, 2021

View reviewed changes

thomasjpfan mentioned this pull request Jun 13, 2021

API options for Pandas output #20258

Closed

amueller previously approved these changes Jul 9, 2021

View reviewed changes

thomasjpfan mentioned this pull request Oct 31, 2021

ColumnTransformers should use get_feature_names_out() when columns attribute is not available #21452

Open

thomasjpfan closed this Sep 16, 2022

lorentzenchr added the Superseded PR has been replace by a newer PR label Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

thomasjpfan commented May 16, 2021

adrinjalali left a comment

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

amueller Jun 11, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

adrinjalali May 31, 2021

amueller Jun 11, 2021

thomasjpfan commented Jun 4, 2021

amueller left a comment

amueller Jun 11, 2021

adrinjalali Jul 11, 2021

amueller Jun 11, 2021

amueller Jun 11, 2021

adrinjalali Jul 11, 2021

amueller Jun 11, 2021

thomasjpfan commented Jun 13, 2021

amueller commented Jun 26, 2021

lorentzenchr commented Sep 16, 2022

thomasjpfan commented Sep 16, 2022

	If True, the `n_feature_names_` attribute is set to the feature
	If True, the `feature_names_in_` attribute is set to the feature

		returned. If "default", an array-like without feature names is
		returned.

		return get_feature_names_out_callable


		def _make_array_out(X_out, index, get_feature_names_out, *,

	def _make_array_out(X_out, index, get_feature_names_out, *,
	def _make_array_out(X_out, *, index, get_feature_names_out,

	"""Construct array container based on global configuration.
	"""Construct array container based on the value of `array_out`.

	returned. If "default", an array-like without feature names is
	returned.
	returned. If "default", `X_out` is returned unmodified.

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

ENH Adds array_out="pandas" to transformers in preprocessing module #20100

Conversation

thomasjpfan commented May 16, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jun 4, 2021

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Jun 13, 2021

amueller commented Jun 26, 2021

lorentzenchr commented Sep 16, 2022

thomasjpfan commented Sep 16, 2022