scikit-learn · adrinjalali · Nov 30, 2022 · Oct 3, 2020 · Oct 5, 2020 · Oct 6, 2020
diff --git a/index.rst b/index.rst
@@ -12,6 +12,7 @@
     slep007/proposal
     slep012/proposal
     slep013/proposal
+    slep015/proposal
 
 .. toctree::
     :maxdepth: 1

diff --git a/slep015/proposal.rst b/slep015/proposal.rst
@@ -0,0 +1,179 @@
+.. _slep_015:
+
+==================================
+SLEP015: Feature Names Propagation
+==================================
+
+:Author: Thomas J Fan
+:Status: Draft
+:Type: Standards Track
+:Created: 2020-10-03
+
+Abstract
+########
+
+This SLEP proposes adding the ``feature_names_in_`` attribute for all estimators
+and the ``get_feature_names_out`` method to all transformers.
+
+Motivation
+##########
+
+``scikit-learn`` is commonly used as a part of a larger data processing
+pipeline. When this pipeline is used to transform data, the result is a
+NumPy array, discarding column names. The current workflow for
+extracting the feature names requires calling ``get_feature_names`` on the
+transformer that created the feature. This interface can be cumbersome when used
+together with a pipeline with multiple column names::
-together with a pipeline with multiple column names::
+together with a Pipeline with multiple column names::
-together with a pipeline with multiple column names::
+together with a Pipeline with multiple column names::
+
+    X = pd.DataFrame({'letter': ['a', 'b', 'c'],
+                      'pet': ['dog', 'snake', 'dog'],
+                      'distance': [1, 2, 3]})
+    y = [0, 0, 1]
+    orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
+
+    ct = ColumnTransformer(
+        [('cat', OneHotEncoder(), orig_cat_cols),
+         ('num', StandardScaler(), orig_num_cols)])
+    pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
+
+    cat_names = (pipe['columntransformer']
+                 .named_transformers_['onehotencoder']
+                 .get_feature_names(orig_cat_cols))
+
+    feature_names = np.r_[cat_names, orig_num_cols]
+
+The ``feature_names`` extracted above corresponds to the features directly
+passed into ``LogisticRegression``. As demonstrated above, the process of
+extracting ``feature_names`` requires knowing the order of the selected
+categories in the ``ColumnTransformer``. Furthermore, if there is feature
+selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
+would need to be used to select column names that were selected.
-would need to be used to select column names that were selected.
+would need to be used to infer the column names that were selected.
-would need to be used to select column names that were selected.
+would need to be used to infer the column names that were selected.
+
+Solution
+########
+
+This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
+that will extract the feature names of ``X`` during ``fit``. This will also
+be used for validation during non-``fit`` methods such as ``transform`` or
+``predict``. If the ``X`` is not a recognized container, then
-``predict``. If the ``X`` is not a recognized container, then
+``predict``. If the ``X`` is not a recognized container with columns, then
-``predict``. If the ``X`` is not a recognized container, then
+``predict``. If the ``X`` is not a recognized container with columns, then
+``feature_names_in_`` would be set to ``None``.
+
+Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
+to all transformers. By default, the input features will be determined by the
+``feature_names_in_`` attribute. The feature names of a pipeline can then be
+easily extracted as follows::
+
+    pipe[:-1].get_feature_names_out()
+    # ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
+       'cat__pet_dog', 'cat__pet_snake', 'num__distance']
+
+Note that ``get_feature_names_out`` does not require ``input_names``
+because the feature names was stored in the pipeline itself. These
+features will be passed to each step's ``get_feature_names_out`` method to
+obtain the output feature names of the ``Pipeline`` itself.
+
+Enabling Functionality
+######################
+
+The following enhancements are **not** a part of this SLEP. These features are
+made possible if this SLEP gets accepted.
+
+1. As an alternative to slicing, we can add a
+   ``Pipeline.get_feature_names_in_at`` method to get the names at a specific
+   step. This can be a simple alternative to slicing::
+
+      pipe.get_feature_names_in_at(-1)
+
+2. This SLEP enables us to implement an ``array_out`` keyword argument to
+   all ``transform`` methods to specify the array container outputted by
+   ``transform``. An implementation of ``array_out`` requires
+   ``feature_names_in_`` to validate that the names in ``fit`` and
+   ``transform`` are consistent. With the implementation of ``array_out`` needs
-   ``transform`` are consistent. With the implementation of ``array_out`` needs
+   ``transform`` are consistent. An implementation of ``array_out`` needs
-   ``transform`` are consistent. With the implementation of ``array_out`` needs
+   ``transform`` are consistent. An implementation of ``array_out`` needs
+   a way to map from the input feature names to output feature names, which is
+   provided by ``get_feature_names_out``.
+
+3. An alternative to ``array_out``: Transformers in a pipeline may wish to have
+   feature names passed in as ``X``. This can be enabled by adding a
+   ``array_input`` parameter to ``Pipeline``::
+
+        pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
+                             array_input='pandas')
+
+   In this case, the pipeline will construct a pandas DataFrame to be inputted
+   into ``MyTransformer`` and ``LogisticRegression``. The feature names
+   will be constructed by calling ``get_feature_names_out`` as data is passed
+   through the ``Pipeline``. This feature implies that ``Pipeline`` is
+   doing the construction of the DataFrame.
+
+Considerations
+##############
+
+1. The ``get_feature_names_out`` will be constructed using the name generation
+   specification from :ref:`slep_007`.
+
+2. For a ``Pipeline`` with only one estimator, slicing will not work and one
+   would need to access the feature names directly::
+
+      pipe = make_pipeline(LogisticRegression())
+      pipe[-1].feature_names_in_
+
+Backward compatibility
+######################
+
+1. This SLEP is fully backward compatible with previous versions. With the
+   introduction of ``get_feature_names_out``, ``get_feature_names`` will
+   be deprecated.
+
+2. The inclusion of a ``get_feature_names_out`` method will not introduce any
+   overhead to estimators.
+
+3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
+   estimators because they would store the feature names.
+
+Community Adoption
+##################
+
+We can enforce the ``feature_names_in_`` attribute and
+``get_feature_names_out`` method with additional tests to
+``check_estimator``.
+
+Alternatives
+############
+
+There have been many attempts to address this issue:
+
+1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
+   third party estimators to unwrap and wrap array containers in transform,
+   which introduces more burden for third party estimator maintainers.
+   Furthermore, ``array_out`` with sparse data will introduce an overhead when
+   being passed along in a ``Pipeline``. This overhead comes from the
+   construction of the sparse data container that has the feature names.
+
+2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
+   while this SLEP proposes a ``get_feature_names_out`` method to accomplish
+   the same task. The benefit of the ``get_feature_names_out`` method is that
+   it can be used even if the feature names were not passed in ``fit`` with a
+   dataframe. For example, in a ``Pipeline`` the feature names are not passed
+   through to each step and a ``get_feature_names_out`` method can be used to
+   get the names of each step with slicing.
+
+3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
+   overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
+   introduction of another data structure into the Python Data Ecosystem, would
+   lead to more burden for third party estimator maintainers.
+
+
+References and Footnotes
+########################
+
+.. [1] Each SLEP must either be explicitly labeled as placed in the public
+   domain (see this SLEP as an example) or licensed under the `Open
+   Publication License`_.
+
+.. _Open Publication License: https://www.opencontent.org/openpub/
+
+
+Copyright
+#########
+
+This document has been placed in the public domain. [1]_