New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP015: Feature Names Propagation #48
Changes from 8 commits
3b6631d
55b62f6
2f1d2f8
22b9b00
4392546
0608c37
902f792
1fff514
388eda8
7537b15
8916cb1
e9b275c
ffd0954
3f6426f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
slep007/proposal | ||
slep012/proposal | ||
slep013/proposal | ||
slep015/proposal | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,179 @@ | ||||||
.. _slep_015: | ||||||
|
||||||
================================== | ||||||
SLEP015: Feature Names Propagation | ||||||
================================== | ||||||
|
||||||
:Author: Thomas J Fan | ||||||
:Status: Draft | ||||||
:Type: Standards Track | ||||||
:Created: 2020-10-03 | ||||||
|
||||||
Abstract | ||||||
######## | ||||||
|
||||||
This SLEP proposes adding the ``feature_names_in_`` attribute for all estimators | ||||||
and the ``get_feature_names_out`` method to all transformers. | ||||||
|
||||||
Motivation | ||||||
########## | ||||||
|
||||||
``scikit-learn`` is commonly used as a part of a larger data processing | ||||||
pipeline. When this pipeline is used to transform data, the result is a | ||||||
NumPy array, discarding column names. The current workflow for | ||||||
extracting the feature names requires calling ``get_feature_names`` on the | ||||||
transformer that created the feature. This interface can be cumbersome when used | ||||||
together with a pipeline with multiple column names:: | ||||||
|
||||||
X = pd.DataFrame({'letter': ['a', 'b', 'c'], | ||||||
'pet': ['dog', 'snake', 'dog'], | ||||||
'distance': [1, 2, 3]}) | ||||||
y = [0, 0, 1] | ||||||
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] | ||||||
|
||||||
ct = ColumnTransformer( | ||||||
[('cat', OneHotEncoder(), orig_cat_cols), | ||||||
('num', StandardScaler(), orig_num_cols)]) | ||||||
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) | ||||||
|
||||||
cat_names = (pipe['columntransformer'] | ||||||
.named_transformers_['onehotencoder'] | ||||||
.get_feature_names(orig_cat_cols)) | ||||||
|
||||||
feature_names = np.r_[cat_names, orig_num_cols] | ||||||
|
||||||
The ``feature_names`` extracted above corresponds to the features directly | ||||||
passed into ``LogisticRegression``. As demonstrated above, the process of | ||||||
extracting ``feature_names`` requires knowing the order of the selected | ||||||
categories in the ``ColumnTransformer``. Furthermore, if there is feature | ||||||
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method | ||||||
would need to be used to select column names that were selected. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Solution | ||||||
######## | ||||||
|
||||||
This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators | ||||||
that will extract the feature names of ``X`` during ``fit``. This will also | ||||||
be used for validation during non-``fit`` methods such as ``transform`` or | ||||||
``predict``. If the ``X`` is not a recognized container, then | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
``feature_names_in_`` would be set to ``None``. | ||||||
|
||||||
Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)`` | ||||||
to all transformers. By default, the input features will be determined by the | ||||||
``feature_names_in_`` attribute. The feature names of a pipeline can then be | ||||||
easily extracted as follows:: | ||||||
|
||||||
pipe[:-1].get_feature_names_out() | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and maybe mention? pipe[-1].feature_names_in_
ogrisel marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c', | ||||||
'cat__pet_dog', 'cat__pet_snake', 'num__distance'] | ||||||
|
||||||
Note that ``get_feature_names_out`` does not require ``input_names`` | ||||||
because the feature names was stored in the pipeline itself. These | ||||||
features will be passed to each step's ``get_feature_names_out`` method to | ||||||
obtain the output feature names of the ``Pipeline`` itself. | ||||||
|
||||||
Enabling Functionality | ||||||
###################### | ||||||
|
||||||
The following enhancements are **not** a part of this SLEP. These features are | ||||||
made possible if this SLEP gets accepted. | ||||||
|
||||||
1. As an alternative to slicing, we can add a | ||||||
``Pipeline.get_feature_names_in_at`` method to get the names at a specific | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this name unpleasant, and don't see what's so much better than There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will not work be default because the final step would not have the feature names if there is more than one step: pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
# pipe1[-1].feature_names_in_ does not exist
pipe2 = make_pipeline(LogisticRegression())
# pipe2[-1].feature_names_in_ does exist This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names. I'll remove this point to make this SLEP shorter. |
||||||
step. This can be a simple alternative to slicing:: | ||||||
|
||||||
pipe.get_feature_names_in_at(-1) | ||||||
|
||||||
2. This SLEP enables us to implement an ``array_out`` keyword argument to | ||||||
all ``transform`` methods to specify the array container outputted by | ||||||
``transform``. An implementation of ``array_out`` requires | ||||||
``feature_names_in_`` to validate that the names in ``fit`` and | ||||||
``transform`` are consistent. With the implementation of ``array_out`` needs | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
a way to map from the input feature names to output feature names, which is | ||||||
provided by ``get_feature_names_out``. | ||||||
|
||||||
3. An alternative to ``array_out``: Transformers in a pipeline may wish to have | ||||||
feature names passed in as ``X``. This can be enabled by adding a | ||||||
``array_input`` parameter to ``Pipeline``:: | ||||||
|
||||||
pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), | ||||||
array_input='pandas') | ||||||
|
||||||
In this case, the pipeline will construct a pandas DataFrame to be inputted | ||||||
into ``MyTransformer`` and ``LogisticRegression``. The feature names | ||||||
will be constructed by calling ``get_feature_names_out`` as data is passed | ||||||
through the ``Pipeline``. This feature implies that ``Pipeline`` is | ||||||
doing the construction of the DataFrame. | ||||||
|
||||||
Considerations | ||||||
############## | ||||||
|
||||||
1. The ``get_feature_names_out`` will be constructed using the name generation | ||||||
specification from :ref:`slep_007`. | ||||||
|
||||||
2. For a ``Pipeline`` with only one estimator, slicing will not work and one | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway?? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to distinguish between the following two pipelines: pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out() # this works
pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out() # does not work
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline... |
||||||
would need to access the feature names directly:: | ||||||
|
||||||
pipe = make_pipeline(LogisticRegression()) | ||||||
pipe[-1].feature_names_in_ | ||||||
|
||||||
Backward compatibility | ||||||
###################### | ||||||
|
||||||
1. This SLEP is fully backward compatible with previous versions. With the | ||||||
introduction of ``get_feature_names_out``, ``get_feature_names`` will | ||||||
be deprecated. | ||||||
|
||||||
2. The inclusion of a ``get_feature_names_out`` method will not introduce any | ||||||
overhead to estimators. | ||||||
|
||||||
3. The inclusion of a ``feature_names_in_`` attribute will increase the size of | ||||||
estimators because they would store the feature names. | ||||||
|
||||||
Community Adoption | ||||||
################## | ||||||
|
||||||
We can enforce the ``feature_names_in_`` attribute and | ||||||
``get_feature_names_out`` method with additional tests to | ||||||
``check_estimator``. | ||||||
|
||||||
Alternatives | ||||||
############ | ||||||
|
||||||
There have been many attempts to address this issue: | ||||||
|
||||||
1. ``array_out`` in keyword parameter in ``transform`` : This approach requires | ||||||
third party estimators to unwrap and wrap array containers in transform, | ||||||
which introduces more burden for third party estimator maintainers. | ||||||
Furthermore, ``array_out`` with sparse data will introduce an overhead when | ||||||
being passed along in a ``Pipeline``. This overhead comes from the | ||||||
construction of the sparse data container that has the feature names. | ||||||
|
||||||
2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute | ||||||
while this SLEP proposes a ``get_feature_names_out`` method to accomplish | ||||||
the same task. The benefit of the ``get_feature_names_out`` method is that | ||||||
it can be used even if the feature names were not passed in ``fit`` with a | ||||||
dataframe. For example, in a ``Pipeline`` the feature names are not passed | ||||||
through to each step and a ``get_feature_names_out`` method can be used to | ||||||
get the names of each step with slicing. | ||||||
|
||||||
3. :ref:`slep_012` : The ``InputArray`` was developed to work around the | ||||||
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The | ||||||
introduction of another data structure into the Python Data Ecosystem, would | ||||||
lead to more burden for third party estimator maintainers. | ||||||
|
||||||
|
||||||
References and Footnotes | ||||||
######################## | ||||||
|
||||||
.. [1] Each SLEP must either be explicitly labeled as placed in the public | ||||||
domain (see this SLEP as an example) or licensed under the `Open | ||||||
Publication License`_. | ||||||
|
||||||
.. _Open Publication License: https://www.opencontent.org/openpub/ | ||||||
|
||||||
|
||||||
Copyright | ||||||
######### | ||||||
|
||||||
This document has been placed in the public domain. [1]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.