Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnTransformer get_feature_names on more transformers #18993

Closed
swight-prc opened this issue Dec 11, 2020 · 3 comments
Closed

ColumnTransformer get_feature_names on more transformers #18993

swight-prc opened this issue Dec 11, 2020 · 3 comments

Comments

@swight-prc
Copy link

swight-prc commented Dec 11, 2020

Describe the workflow you want to enable

It would be nice if the ColumnTransformer would get_feature_names from even transformers that don't implement get_feature_names, and used the full API of get_feature_names in transformers where it has been implemented.

Describe your proposed solution

Current:
    def get_feature_names(self):
        """Get feature names from all transformers.
        Returns
        -------
        feature_names : list of strings
            Names of the features produced by transform.
        """
        check_is_fitted(self)
        feature_names = []
        for name, trans, column, _ in self._iter(fitted=True):
            if trans == 'drop' or (
                    hasattr(column, '__len__') and not len(column)):
                continue
            if trans == 'passthrough':
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(column)
                    else:
                        feature_names.extend(self._df_columns[column])
                else:
                    indices = np.arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
            if not hasattr(trans, 'get_feature_names'):
                raise AttributeError("Transformer %s (type %s) does not "
                                     "provide get_feature_names."
                                     % (str(name), type(trans).__name__))
            feature_names.extend([name + "__" + f for f in
                                  trans.get_feature_names()])
        return feature_names

If a transformer does not implement get_feature_names, it simply raises an error.

If a transformer DOES implement get_feature_names, the ColumnTransformer ignores part of that API (ignoring fitted column names, using instead an integer column index).

Proposed Solution:
    def get_feature_names(self):
        from sklearn.utils.validation import check_is_fitted
        from numpy import arange
        """Get feature names from all transformers.
        Returns
        -------
        feature_names : list of strings
            Names of the features produced by transform.
        """
        check_is_fitted(self)
        feature_names = []
        for name, trans, column, _ in self._iter(fitted=True):
            if trans == 'drop' or (
                    hasattr(column, '__len__') and not len(column)):
                continue
            if trans == 'passthrough':
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(column)
                    else:
                        feature_names.extend(self._df_columns[column])
                else:
                    indices = arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
            if not hasattr(trans, 'get_feature_names'):
                # ADDED SECTION A
                if hasattr(self, '_df_columns'):
                    if ((not isinstance(column, slice))
                            and all(isinstance(col, str) for col in column)):
                        feature_names.extend(f'{name}_{col}' for col in column)
                    else:
                        feature_names.extend(
                            f'{name}_{col}'
                            for col in self._df_columns[column]
                            )
                else:
                    indices = arange(self._n_features)
                    feature_names.extend(['x%d' % i for i in indices[column]])
                continue
                # END SECTION A
            # ADDED SECTION B
            gfn_args = inspect.getfullargspec(trans.get_feature_names).args
            args_to_send = []
            if ('input_features' in gfn_args) and \
                    not isinstance(column, slice):
                args_to_send = [column]
            feature_names.extend([name + "__" + f for f in
                                  trans.get_feature_names(*args_to_send)])
            # END SECTION B
        return feature_names
Section A adds:

<<transformer name>>_<<column>> for each transformer that doesn't implement get_feature_names

Section A removes:

Raising an error

Section B adds:

<<transformer name>>__<<output of get_feature_names>> for the transformer by sending in the column names it received at fit.

So - if the transformer doesn't implement get_feature_names, we either return the column names (in the case of a 1:1 transformation), or an integer index.
If the transformer DOES implement get_feature_names, we try to get the original feature names that were fed in, and use them to get more descriptive feature names from each transformer.
If that isn't possible, we fall back to the original behavior.

Describe alternatives you've considered, if relevant

The alternative is to stay with what it is. But I think this is a valuable addition.

Additional context

I know I haven't considered every eventuality, which is why there is not a pull request associated with this feature request. But I do think I'm close, and I would welcome any input.

@glemaitre
Copy link
Member

This feature is going to discuss and implemented within a SLEP: scikit-learn/enhancement_proposals#48
We want to have a consistent API for that matter. I am closing this issue since this was already discussed and this is a duplicate.

@swight-prc
Copy link
Author

Oh, good. I searched but didn't find it.

Glad to hear it's under consideration!

@glemaitre
Copy link
Member

glemaitre commented Dec 14, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants