Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLEP015: Feature Names Propagation #48

Merged

Conversation

thomasjpfan
Copy link
Member

This SLEP details how a feature_names_in_ attribute and a get_feature_names_out method can be used together to obtain feature name propagation. I see there are two main goals for having this feature:

  1. Inspect the feature names inputted into the last step of a pipeline. This is address as the motivating problem in the SLEP.
  2. Be able to use panda's dataframes or xarray's dataarray as input for estimators in a pipeline. This allows for estimators to do something special based on the name of a feature. This is addressed in the "Enabling Functionality" section with a new parameter in Pipeline.

Related to scikit-learn/scikit-learn#18444 - get_feature_names_out PR
Related to scikit-learn/scikit-learn#16772 - array_out PR
Related to scikit-learn/scikit-learn#18010 - feature_names_out_ PR

CC @scikit-learn/core-devs

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?

slep015/proposal.rst Outdated Show resolved Hide resolved
attribute. The feature names of a pipeline can then be easily extracted as
follows::

pipe[:-1].get_feature_names_out()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe mention?

pipe[-1].feature_names_in_

Comment on lines 90 to 93
In this case, the pipeline will construct a pandas DataFrame to be inputted
into ``MyTransformer`` and ``LogisticRegression``. The feature names
will be constructed by calling ``get_feature_names_out`` as data is passed
through the ``Pipeline``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies it's the pipeline doing it. Or do you mean a pandas DF is passed around as your PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am implying that Pipeline is doing this. I update the doc to make this point explicit. I also added more details on how this SLEP relates to array_out.

slep015/proposal.rst Outdated Show resolved Hide resolved
Comment on lines 98 to 103
This SLEP requires all estimators to store ``feature_names_in_` for all
estimators, which will increase the size of the estimators. By default, a
``Pipeline`` will only store ``feature_names_in_`` in the first step and
the rest can be computed by slicing the pipeline at different steps. In other
words, the additional space used will be at a minimal because only the
input feature names from the first step are stored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then are we storing feature names in each step or not? This paragraph is a bit confusing to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, slicing from the middle of the pipeline, would lose the feature names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we need to implement a __getitem__ to recreate feature_names_in_ when start of the slice is not 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, slicing from the middle of the pipeline, would lose the feature names.

We can make this work by constructing the feature_names_in_ and attaching it to the new pipeline.

@thomasjpfan
Copy link
Member Author

@adrinjalali I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?

One of the reasons I wrote this SLEP was scikit-learn/scikit-learn#16772 (comment). It highlighted the performance issues for sparse data which is related to the InputArray idea.

One can consider this SLEP a stepping stone toward passing feature names down the pipeline. Furthermore this SLEP is a prerequisite for the array_out PR:

  1. feature_names_in_ is required to make sure fit and transform has the same feature names.
  2. get_feature_names_out needs to be implemented everywhere in the array_out PR, but privately.

This SLEP extends Andy's idea by including the interaction between feature_names_in_ and get_feature_names_out. Since every estimator has the ability to store feature_names_in_ users would not need to pass in the input names in get_feature_names_out in a pipeline or with a single estimator to get the output names.

Yes every step in a pipeline would not have the names, but these names can be obtained through slicing and if the estimator needs the name in fit, we can add a parameter in pipeline to tell it to construct the dataframe which it will pass to one of its steps. (This is an idea proposed in this SLEP but not a part of this SLEP)

TLDR: I think this SLEP is simpler and resolves some of the pain points we have with feature names.

@thomasjpfan
Copy link
Member Author

Updated SLEP is state that the Pipeline will have feature_names_in_.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Thanks for working on this! I've mainly nitpicks.

slep015/proposal.rst Outdated Show resolved Hide resolved
slep015/proposal.rst Outdated Show resolved Hide resolved
slep015/proposal.rst Outdated Show resolved Hide resolved
slep015/proposal.rst Outdated Show resolved Hide resolved
be deprecated.

The inclusion of ``get_feature_names_out`` and ``feature_names_in_`` will
not introduce any overhead to ``Pipeline``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a slight overhead for very wide data with a lot of columns?

slep015/proposal.rst Outdated Show resolved Hide resolved
slep015/proposal.rst Outdated Show resolved Hide resolved
slep015/proposal.rst Outdated Show resolved Hide resolved
@ogrisel ogrisel self-requested a review October 9, 2020 14:47
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have this merged. And looks good to me for a vote from my side :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good overall, but it needs to:

  • specify the data type of feature_names_in_ and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?
  • address concerns about memory

If we are passing around a wide sparse matrix, and generating feature_names_in_ at each step in a pipeline, this could be consuming a lot of memory unnecessarily. Should we be introducing something like the following to avoid unnecessary memory usage with self.feature_names_in_ = DefaultFeatureNames(X.shape[1])?

class DefaultFeatureNames(collections.abc.Sequence):
    def __init__(self, n_features):
        self.n_features = n_features

    def __len__(self):
        return self.n_features

    def __getitem__(self, sl):
        if isinstance(sl, slice):
            return [f"x{i}" for i in range(len(self))]
        elif isinstance(sl, tuple):
            raise NotImplementedError
        return f"x{sl}"

    def __iter__(self):
        return (f"x{i}" for i in range(len(self)))

extracting ``feature_names`` requires knowing the order of the selected
categories in the ``ColumnTransformer``. Furthermore, if there is feature
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
would need to be used to select column names that were selected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
would need to be used to select column names that were selected.
would need to be used to infer the column names that were selected.

made possible if this SLEP gets accepted.

1. As an alternative to slicing, we can add a
``Pipeline.get_feature_names_in_at`` method to get the names at a specific
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name unpleasant, and don't see what's so much better than Pipeline[-1].feature_names_in_

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work be default because the final step would not have the feature names if there is more than one step:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
# pipe1[-1].feature_names_in_ does not exist

pipe2 = make_pipeline(LogisticRegression())
# pipe2[-1].feature_names_in_ does exist

This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.

I'll remove this point to make this SLEP shorter.

all ``transform`` methods to specify the array container outputted by
``transform``. An implementation of ``array_out`` requires
``feature_names_in_`` to validate that the names in ``fit`` and
``transform`` are consistent. With the implementation of ``array_out`` needs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``transform`` are consistent. With the implementation of ``array_out`` needs
``transform`` are consistent. An implementation of ``array_out`` needs

This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
that will extract the feature names of ``X`` during ``fit``. This will also
be used for validation during non-``fit`` methods such as ``transform`` or
``predict``. If the ``X`` is not a recognized container, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``predict``. If the ``X`` is not a recognized container, then
``predict``. If the ``X`` is not a recognized container with columns, then

1. The ``get_feature_names_out`` will be constructed using the name generation
specification from :ref:`slep_007`.

2. For a ``Pipeline`` with only one estimator, slicing will not work and one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to distinguish between the following two pipelines:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out()  # this works

pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out()  # does not work

pipe2[:-1] fails because the slicing will produce a pipeline with no steps. Although, we can allow pipelines with no steps to get pipe2[:-1].get_feature_names_out() to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.

Oh! I don't think I understood this at all from the SLEP. Isn't that contradicted by "This SLEP proposes adding the feature_names_in_ attribute to all estimators
that will extract the feature names of X during fit."?

@thomasjpfan
Copy link
Member Author

This SLEP proposes adding the feature_names_in_ attribute to all estimators
that will extract the feature names of X during fit.

I was thinking of the "outer" most layer of the API. For non-meta estimators, feature_names_in_ would be defined if the feature names can be extract from X.

For metaestimators, such as pipeline, they would have pipe.feature_names_in_ defined but we would not guarantee that inner estimators would have feature_names_in_ defined.

@jnothman
Copy link
Member

jnothman commented Oct 14, 2020 via email

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Oct 14, 2020

Neither of these limitations are discussed in the SLEP afaict.

Thank you for your thoughts, I'll update the SLEP accordingly.

I don't really get how you define "outer"

I was using "outer" and "inner" in context of a metaestimator. The metaestimator would be the "outer" estimator, while all estimators inside the meta estimators are "inner" estimators.

if feature_names_in_ was only set when X was in a format that had names attached.

This is the form I was considering.

unless either feature_names_in_ was being set not by an estimator in its fit method, but by the caller of that estimator's fit method

An metaestimator could have all sorts of estimators that it uses internally. I think having the metaestimator be responsible for setting feature_names_in_ for all its estimators would be too much of a requirement for metaestimator developers.

This SLEP is trying to proposal the bare minimum: If X has names and is passed to fit, then the estimator stores them as feature_names_in_.

Metaestimators is a case where I would want it to delegate this responsibly to its inner estimators. This way the metaestimator can construct metaestimator.feature_names_in_ for itself. In the case of Pipeline, it would be the feature_names_in_ of the first step.

@thomasjpfan
Copy link
Member Author

specify the data type of feature_names_in_ and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?

I was considering any Sequence: ffd0954 (#48).

@jnothman I have updated the SLEP to addresss your concerns:

  1. Regarding meta-estimators: 7537b15 (#48)
  2. Abstract now explicitly states that feature_names_in_ are set in fit.
  3. An example of slicing can fail with a pipeline with one step was added.

@@ -121,7 +121,10 @@ Considerations and Limitations
a pipeline with no steps. We can work around this by allowing pipelines
with no steps.

3. Meta-estimators will delegate the setting and validation of
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But ndarray is not a sequence: numpy/numpy#2776

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Iterable that returns a string" would be enough.

In our discussions, I think we want to make sure the feature names are strings.

Copy link
Member

@jnothman jnothman Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We'd better accept Sequences and 1d array-likes whose elements are strings: pd.Index is not a Sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

@jnothman
Copy link
Member

jnothman commented Oct 14, 2020

So now my understanding (from conversation, not reviewing the SLEP) is that feature_names_in_ should be set directly by every estimator that cannot delegate it.

Let's clarify:

  • When input is a list of strings, feature_names_in_ = None.
  • When input is a pd.DataFrame, feature_names_in_ = X.columns. Or is it X.columns.copy()?
  • Do we provide ducktyped support for anything with .columns?
  • When input is a 2d array or sparse matrix, feature_names_in_ is what?
  • In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?
  • How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?

@thomasjpfan
Copy link
Member Author

When input is a list of strings, feature_names_in_ = None.

If we want to be consistent with n_feature_names this would be the case. I think I would prefer it to not be defined, which means there is no validation in non-fit methods.

When input is a pd.DataFrame, feature_names_in_ = X.columns. Or is it X.columns.copy()?

It would be safer to copy, but I would prefer not to.

When input is a 2d array or sparse matrix, feature_names_in_ is what?

Not defined and will not validate.

Do we provide ducktyped support for anything with .columns?

In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?

I think the safest thing to do is to restrict this SLEP to pandas until we get a frame-like protocol. When this frame-like protocol is defined, then we can say we support frame-like objects.

How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?

In scikit-learn/scikit-learn#18010, I am pushing for using _validate_data to handle this. Internally, a third party estimator can also use _check_feature_names.

I think we should adjust the SLEP to make feature_names_in_ optional. We can increase its adoption by demonstrating how useful it is with pipelines and other meta-estimators.

@jnothman
Copy link
Member

jnothman commented Oct 17, 2020 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the discussion above, we need to make it clear that support for this is optional outside of the core library... Or we need a new discussion about how to make _validate_data's capabilities publicly available (either by making it a public function in sklearn.utils, or by defining a "protected" estimator class API).

NumPy array, discarding column names. The current workflow for
extracting the feature names requires calling ``get_feature_names`` on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
together with a pipeline with multiple column names::
together with a Pipeline with multiple column names::

@@ -121,7 +121,10 @@ Considerations and Limitations
a pipeline with no steps. We can work around this by allowing pipelines
with no steps.

3. Meta-estimators will delegate the setting and validation of
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

with no steps.

3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
an ndarray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth noting that this allowance can avoid unnecessary memory consumption/copies, with reduced implementation complexity, although it may reduce usability a bit.

@lorentzenchr
Copy link
Member

Is this PR superseded by acceptance of SLEP007 in #59?

@adrinjalali
Copy link
Member

Is this PR superseded by acceptance of SLEP007 in #59?

No, that one is about how the feature names are generated, this one is about how they're propagated in a pipeline.

@lorentzenchr
Copy link
Member

Does this supersede #18, meaning it's an alternative?

@adrinjalali
Copy link
Member

I'd say this is yet another effort to do the same thing as #18 wanted to do, but a lot more updated, after having gone down a few paths and not getting anywhere. I would need to read both again to say if this one is an alternative or just supersedes the other one.

@adrinjalali
Copy link
Member

Also, I'd be happy to merge both of them and continue discussing on a separate issue.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 6, 2021

A bunch of SLEP015 is already include in SLEP007 which has already been accepted. What SLEP015 adds is an API for actually outputting pandas DataFrames.

I think #18 (SLEP008) has more or less been superseded by SLEP007.

@adrinjalali
Copy link
Member

back when we were writing sleps 7 and 8, I wrote 7 only to just talk about how we create the feature names, and 8 was more about the options we have to propagate them. Since then, I worked on a sklearn dataframe kinda object which we decided not to do, then worked on xarray, then Thomas worked on pandas, and the API also kinda evolved over time. I think we could at least focus on this SLEP for now, and then figure out what to do with other containers. We've gone back and forth with which container to use, or to use one at all, instead of propagating feature names alongside the data instead of with the data in a container.

@amueller
Copy link
Member

This has been superseded by SLEP 18 #72 right?

@thomasjpfan
Copy link
Member Author

I updated this PR so that this SLEP is now rejected.

@adrinjalali
Copy link
Member

Thanks @thomasjpfan . Then I think we can merge.

@adrinjalali adrinjalali merged commit 221362b into scikit-learn:master Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants