ColumnTransformer get_feature_names on more transformers #18993

swight-prc · 2020-12-11T17:03:22Z

Describe the workflow you want to enable

It would be nice if the ColumnTransformer would get_feature_names from even transformers that don't implement get_feature_names, and used the full API of get_feature_names in transformers where it has been implemented.

Describe your proposed solution

Current:

    def get_feature_names(self):
        """Get feature names from all transformers.
        Returns
        -------
        feature_names : list of strings
            Names of the features produced by transform.
        """
        check_is_fitted(self)
        feature_names = []
        for name, trans, column, _ in self._iter(fitted=True):
            if trans == 'drop' or (
                    hasattr(column, '__len__') and not len(column)):
                continue
            if trans == 'passthrough':
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(column)
                    else:
                        feature_names.extend(self._df_columns[column])
                else:
                    indices = np.arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
            if not hasattr(trans, 'get_feature_names'):
                raise AttributeError("Transformer %s (type %s) does not "
                                     "provide get_feature_names."
                                     % (str(name), type(trans).__name__))
            feature_names.extend([name + "__" + f for f in
                                  trans.get_feature_names()])
        return feature_names

If a transformer does not implement get_feature_names, it simply raises an error.

If a transformer DOES implement get_feature_names, the ColumnTransformer ignores part of that API (ignoring fitted column names, using instead an integer column index).

Proposed Solution:

    def get_feature_names(self):
        from sklearn.utils.validation import check_is_fitted
        from numpy import arange
        """Get feature names from all transformers.
        Returns
        -------
        feature_names : list of strings
            Names of the features produced by transform.
        """
        check_is_fitted(self)
        feature_names = []
        for name, trans, column, _ in self._iter(fitted=True):
            if trans == 'drop' or (
                    hasattr(column, '__len__') and not len(column)):
                continue
            if trans == 'passthrough':
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(column)
                    else:
                        feature_names.extend(self._df_columns[column])
                else:
                    indices = arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
            if not hasattr(trans, 'get_feature_names'):
                # ADDED SECTION A
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(f'{name}_{col}' for col in column)
                    else:
                        feature_names.extend(
                            f'{name}_{col}'
                            for col in self._df_columns[column]
                            )
                else:
                    indices = arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
                # END SECTION A
            # ADDED SECTION B
            gfn_args = inspect.getfullargspec(trans.get_feature_names).args
            args_to_send = []
            if ('input_features' in gfn_args) and \
                    not isinstance(column, slice):
                args_to_send = [column]
            feature_names.extend([name + "__" + f for f in
                                  trans.get_feature_names(*args_to_send)])
            # END SECTION B
        return feature_names

Section A adds:

<<transformer name>>_<<column>> for each transformer that doesn't implement get_feature_names

Section A removes:

Raising an error

Section B adds:

<<transformer name>>__<<output of get_feature_names>> for the transformer by sending in the column names it received at fit.

So - if the transformer doesn't implement get_feature_names, we either return the column names (in the case of a 1:1 transformation), or an integer index.
If the transformer DOES implement get_feature_names, we try to get the original feature names that were fed in, and use them to get more descriptive feature names from each transformer.
If that isn't possible, we fall back to the original behavior.

Describe alternatives you've considered, if relevant

The alternative is to stay with what it is. But I think this is a valuable addition.

Additional context

I know I haven't considered every eventuality, which is why there is not a pull request associated with this feature request. But I do think I'm close, and I would welcome any input.

The text was updated successfully, but these errors were encountered:

glemaitre · 2020-12-14T09:07:37Z

This feature is going to discuss and implemented within a SLEP: scikit-learn/enhancement_proposals#48
We want to have a consistent API for that matter. I am closing this issue since this was already discussed and this is a duplicate.

swight-prc · 2020-12-14T13:55:15Z

Oh, good. I searched but didn't find it.

Glad to hear it's under consideration!

glemaitre · 2020-12-14T13:57:49Z

It gets sometimes but it should get there in next release hopefully :)

…

On Mon, 14 Dec 2020 at 14:55, Stephen Wight ***@***.***> wrote: Oh, good. I searched but didn't find it. Glad to hear it's under consideration! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#18993 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P5PCWURZ7RSAK4LD5LSUYKNJANCNFSM4UW5WQPQ> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

swight-prc added the New Feature label Dec 11, 2020

glemaitre closed this as completed Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnTransformer get_feature_names on more transformers #18993

ColumnTransformer get_feature_names on more transformers #18993

swight-prc commented Dec 11, 2020 •

edited

glemaitre commented Dec 14, 2020

swight-prc commented Dec 14, 2020

glemaitre commented Dec 14, 2020 via email

ColumnTransformer get_feature_names on more transformers #18993

ColumnTransformer get_feature_names on more transformers #18993

Comments

swight-prc commented Dec 11, 2020 • edited

Describe the workflow you want to enable

Describe your proposed solution

Current:

Proposed Solution:

Section A adds:

Section A removes:

Section B adds:

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Dec 14, 2020

swight-prc commented Dec 14, 2020

glemaitre commented Dec 14, 2020 via email

swight-prc commented Dec 11, 2020 •

edited