Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP015: Feature Names Propagation #48

Merged
1 change: 1 addition & 0 deletions index.rst
Expand Up @@ -12,6 +12,7 @@
slep007/proposal
slep012/proposal
slep013/proposal
slep015/proposal

.. toctree::
:maxdepth: 1
Expand Down
179 changes: 179 additions & 0 deletions slep015/proposal.rst
@@ -0,0 +1,179 @@
.. _slep_015:

==================================
SLEP015: Feature Names Propagation
==================================

:Author: Thomas J Fan
:Status: Draft
:Type: Standards Track
:Created: 2020-10-03

Abstract
########

This SLEP proposes adding the ``feature_names_in_`` attribute for all estimators
and the ``get_feature_names_out`` method to all transformers.

Motivation
##########

``scikit-learn`` is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling ``get_feature_names`` on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
together with a pipeline with multiple column names::
together with a Pipeline with multiple column names::


X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'distance': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']

ct = ColumnTransformer(
[('cat', OneHotEncoder(), orig_cat_cols),
('num', StandardScaler(), orig_num_cols)])
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)

cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))

feature_names = np.r_[cat_names, orig_num_cols]

The ``feature_names`` extracted above corresponds to the features directly
passed into ``LogisticRegression``. As demonstrated above, the process of
extracting ``feature_names`` requires knowing the order of the selected
categories in the ``ColumnTransformer``. Furthermore, if there is feature
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
would need to be used to select column names that were selected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
would need to be used to select column names that were selected.
would need to be used to infer the column names that were selected.


Solution
########

This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
that will extract the feature names of ``X`` during ``fit``. This will also
be used for validation during non-``fit`` methods such as ``transform`` or
``predict``. If the ``X`` is not a recognized container, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``predict``. If the ``X`` is not a recognized container, then
``predict``. If the ``X`` is not a recognized container with columns, then

``feature_names_in_`` would be set to ``None``.

Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
to all transformers. By default, the input features will be determined by the
``feature_names_in_`` attribute. The feature names of a pipeline can then be
easily extracted as follows::

pipe[:-1].get_feature_names_out()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe mention?

pipe[-1].feature_names_in_

ogrisel marked this conversation as resolved.
Show resolved Hide resolved
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
'cat__pet_dog', 'cat__pet_snake', 'num__distance']

Note that ``get_feature_names_out`` does not require ``input_names``
because the feature names was stored in the pipeline itself. These
features will be passed to each step's ``get_feature_names_out`` method to
obtain the output feature names of the ``Pipeline`` itself.

Enabling Functionality
######################

The following enhancements are **not** a part of this SLEP. These features are
made possible if this SLEP gets accepted.

1. As an alternative to slicing, we can add a
``Pipeline.get_feature_names_in_at`` method to get the names at a specific
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name unpleasant, and don't see what's so much better than Pipeline[-1].feature_names_in_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work be default because the final step would not have the feature names if there is more than one step:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
# pipe1[-1].feature_names_in_ does not exist

pipe2 = make_pipeline(LogisticRegression())
# pipe2[-1].feature_names_in_ does exist

This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.

I'll remove this point to make this SLEP shorter.

step. This can be a simple alternative to slicing::

pipe.get_feature_names_in_at(-1)

2. This SLEP enables us to implement an ``array_out`` keyword argument to
all ``transform`` methods to specify the array container outputted by
``transform``. An implementation of ``array_out`` requires
``feature_names_in_`` to validate that the names in ``fit`` and
``transform`` are consistent. With the implementation of ``array_out`` needs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``transform`` are consistent. With the implementation of ``array_out`` needs
``transform`` are consistent. An implementation of ``array_out`` needs

a way to map from the input feature names to output feature names, which is
provided by ``get_feature_names_out``.

3. An alternative to ``array_out``: Transformers in a pipeline may wish to have
feature names passed in as ``X``. This can be enabled by adding a
``array_input`` parameter to ``Pipeline``::

pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
array_input='pandas')

In this case, the pipeline will construct a pandas DataFrame to be inputted
into ``MyTransformer`` and ``LogisticRegression``. The feature names
will be constructed by calling ``get_feature_names_out`` as data is passed
through the ``Pipeline``. This feature implies that ``Pipeline`` is
doing the construction of the DataFrame.

Considerations
##############

1. The ``get_feature_names_out`` will be constructed using the name generation
specification from :ref:`slep_007`.

2. For a ``Pipeline`` with only one estimator, slicing will not work and one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to distinguish between the following two pipelines:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out()  # this works

pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out()  # does not work

pipe2[:-1] fails because the slicing will produce a pipeline with no steps. Although, we can allow pipelines with no steps to get pipe2[:-1].get_feature_names_out() to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline...

would need to access the feature names directly::

pipe = make_pipeline(LogisticRegression())
pipe[-1].feature_names_in_

Backward compatibility
######################

1. This SLEP is fully backward compatible with previous versions. With the
introduction of ``get_feature_names_out``, ``get_feature_names`` will
be deprecated.

2. The inclusion of a ``get_feature_names_out`` method will not introduce any
overhead to estimators.

3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
estimators because they would store the feature names.

Community Adoption
##################

We can enforce the ``feature_names_in_`` attribute and
``get_feature_names_out`` method with additional tests to
``check_estimator``.

Alternatives
############

There have been many attempts to address this issue:

1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
third party estimators to unwrap and wrap array containers in transform,
which introduces more burden for third party estimator maintainers.
Furthermore, ``array_out`` with sparse data will introduce an overhead when
being passed along in a ``Pipeline``. This overhead comes from the
construction of the sparse data container that has the feature names.

2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
while this SLEP proposes a ``get_feature_names_out`` method to accomplish
the same task. The benefit of the ``get_feature_names_out`` method is that
it can be used even if the feature names were not passed in ``fit`` with a
dataframe. For example, in a ``Pipeline`` the feature names are not passed
through to each step and a ``get_feature_names_out`` method can be used to
get the names of each step with slicing.

3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
introduction of another data structure into the Python Data Ecosystem, would
lead to more burden for third party estimator maintainers.


References and Footnotes
########################

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
#########

This document has been placed in the public domain. [1]_