Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP015: Feature Names Propagation #48

Merged
1 change: 1 addition & 0 deletions index.rst
Expand Up @@ -40,6 +40,7 @@
:caption: Rejected

slep014/proposal
slep015/proposal

.. toctree::
:maxdepth: 1
Expand Down
191 changes: 191 additions & 0 deletions slep015/proposal.rst
@@ -0,0 +1,191 @@
.. _slep_015:

==================================
SLEP015: Feature Names Propagation
==================================

:Author: Thomas J Fan
:Status: Rejected
:Type: Standards Track
:Created: 2020-10-03

Abstract
########

This SLEP proposes adding the ``get_feature_names_out`` method to all
transformers and the ``feature_names_in_`` attribute for all estimators.
The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``,
contains the feature names.

Motivation
##########

``scikit-learn`` is commonly used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling ``get_feature_names`` on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
together with a pipeline with multiple column names::
together with a Pipeline with multiple column names::


X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'distance': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']

ct = ColumnTransformer(
[('cat', OneHotEncoder(), orig_cat_cols),
('num', StandardScaler(), orig_num_cols)])
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)

cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))

feature_names = np.r_[cat_names, orig_num_cols]

The ``feature_names`` extracted above corresponds to the features directly
passed into ``LogisticRegression``. As demonstrated above, the process of
extracting ``feature_names`` requires knowing the order of the selected
categories in the ``ColumnTransformer``. Furthermore, if there is feature
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
would need to be used to infer the column names that were selected.

Solution
########

This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
that will extract the feature names of ``X`` during ``fit``. This will also
be used for validation during non-``fit`` methods such as ``transform`` or
``predict``. If the ``X`` is not a recognized container with columns, then
``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined,
then it will not be validated.

Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
to all transformers. By default, the input features will be determined by the
``feature_names_in_`` attribute. The feature names of a pipeline can then be
easily extracted as follows::

pipe[:-1].get_feature_names_out()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe mention?

pipe[-1].feature_names_in_

ogrisel marked this conversation as resolved.
Show resolved Hide resolved
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
'cat__pet_dog', 'cat__pet_snake', 'num__distance']

Note that ``get_feature_names_out`` does not require ``input_names``
because the feature names was stored in the pipeline itself. These
features will be passed to each step's ``get_feature_names_out`` method to
obtain the output feature names of the ``Pipeline`` itself.

Enabling Functionality
######################

The following enhancements are **not** a part of this SLEP. These features are
made possible if this SLEP gets accepted.

1. This SLEP enables us to implement an ``array_out`` keyword argument to
all ``transform`` methods to specify the array container outputted by
``transform``. An implementation of ``array_out`` requires
``feature_names_in_`` to validate that the names in ``fit`` and
``transform`` are consistent. An implementation of ``array_out`` needs
a way to map from the input feature names to output feature names, which is
provided by ``get_feature_names_out``.

2. An alternative to ``array_out``: Transformers in a pipeline may wish to have
feature names passed in as ``X``. This can be enabled by adding a
``array_input`` parameter to ``Pipeline``::

pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
array_input='pandas')

In this case, the pipeline will construct a pandas DataFrame to be inputted
into ``MyTransformer`` and ``LogisticRegression``. The feature names
will be constructed by calling ``get_feature_names_out`` as data is passed
through the ``Pipeline``. This feature implies that ``Pipeline`` is
doing the construction of the DataFrame.

Considerations and Limitations
##############################

1. The ``get_feature_names_out`` will be constructed using the name generation
specification from :ref:`slep_007`.

2. For a ``Pipeline`` with only one estimator, slicing will not work and one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to distinguish between the following two pipelines:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out()  # this works

pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out()  # does not work

pipe2[:-1] fails because the slicing will produce a pipeline with no steps. Although, we can allow pipelines with no steps to get pipe2[:-1].get_feature_names_out() to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline...

would need to access the feature names directly::

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe[:-1].feature_names_in_ # Works

pipe2 = make_pipeline(LogisticRegression())
pipe[:-1].feature_names_in_ # Does not work

This is because `pipe2[:-1]` raises an error because it will result in
a pipeline with no steps. We can work around this by allowing pipelines
with no steps.

3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ndarray is not a sequence: numpy/numpy#2776

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Iterable that returns a string" would be enough.

In our discussions, I think we want to make sure the feature names are strings.

Copy link
Member

@jnothman jnothman Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We'd better accept Sequences and 1d array-likes whose elements are strings: pd.Index is not a Sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

an ndarray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that this allowance can avoid unnecessary memory consumption/copies, with reduced implementation complexity, although it may reduce usability a bit.


4. Meta-estimators will delegate the setting and validation of
``feature_names_in_`` to its inner estimators. The meta-estimator will
define ``feature_names_in_`` by referencing its inner estimators. For
example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as
the input feature names. If the inner estimators do not define
``feature_names_in_`` then the meta-estimator will not defined
``feature_names_in_`` as well.

Backward compatibility
######################

1. This SLEP is fully backward compatible with previous versions. With the
introduction of ``get_feature_names_out``, ``get_feature_names`` will
be deprecated. Note that ``get_feature_names_out``'s signature will
always contain ``input_features`` which can be used or ignored. This
helps standardize the interface for the get feature names method.

2. The inclusion of a ``get_feature_names_out`` method will not introduce any
overhead to estimators.

3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
estimators because they would store the feature names. Users can remove
the attribute by calling ``del est.feature_names_in_`` if they want to
remove the feature and disable validation.

Alternatives
############

There have been many attempts to address this issue:

1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
third party estimators to unwrap and wrap array containers in transform,
which introduces more burden for third party estimator maintainers.
Furthermore, ``array_out`` with sparse data will introduce an overhead when
being passed along in a ``Pipeline``. This overhead comes from the
construction of the sparse data container that has the feature names.

2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
while this SLEP proposes a ``get_feature_names_out`` method to accomplish
the same task. The benefit of the ``get_feature_names_out`` method is that
it can be used even if the feature names were not passed in ``fit`` with a
dataframe. For example, in a ``Pipeline`` the feature names are not passed
through to each step and a ``get_feature_names_out`` method can be used to
get the names of each step with slicing.

3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
introduction of another data structure into the Python Data Ecosystem, would
lead to more burden for third party estimator maintainers.


References and Footnotes
########################

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
#########

This document has been placed in the public domain. [1]_