Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds feature_names_out to impute module #21078

Merged
merged 35 commits into from
Oct 21, 2021

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Sep 17, 2021

Reference Issues/PRs

Continues #18444
Fixes #21200

What does this implement/fix? Explain your changes.

Adds feature_names_out to the impute module.

Any other comments?

There is an edge case I am concerned about:

from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

marker = np.nan
X = np.array(
    [
        [marker, 2],
        [2, marker],
        [6, 3],
        [1, 2],
    ]
)
X_df = pd.DataFrame(X, columns=["a", "indicator_a"])

imputer = SimpleImputer(add_indicator=True)
imputer.fit_transform(X_df)
imputer.get_feature_names_out()
# array(['a', 'indicator_a', 'indicator_a', 'indicator_indicator_a'], dtype=object)

For this edge case, we can add sklearn to the prefix resulting in sklearn_indicator_a?

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan Thanks for your dedication for feature names!

Transformed feature names.
"""
input_features = _check_feature_names_in(self, input_features)
return input_features[self.features_]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our rule for when to prefix and when not? The indicator output column of SimpleImputer is prefixed with "indicator_".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we have a rule. I went with SimpleImputer prefixing "indicator_" because it needed to distinguish between the imputed features and the indicator.

Looking at this again, I think we can go with MissingIndcator to add the prefix. This way SimpleImputer only needs to combine them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this is fine to preserve the original names for the imputed features.

sklearn/impute/tests/test_common.py Outdated Show resolved Hide resolved
sklearn/impute/tests/test_common.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Apart from the previous review comments, I would like to use a more explicit prefix:

sklearn/impute/_base.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Remark: The convention introduced here is tor prefix with prefix = self.__class__.__name__.lower().

@lorentzenchr
Copy link
Member

Before merging, should we wait for #21334 and then also use _ClassNamePrefixFeaturesOutMixin or _generate_get_feature_names_out?

@thomasjpfan
Copy link
Member Author

Before merging, should we wait for #21334 and then also use _ClassNamePrefixFeaturesOutMixin or _generate_get_feature_names_out?

I do not think we need to wait. This PR is slightly different. MissingIndicator prefixes the input feature names, i.e. missingindicator_myfeature, while #21334 and _generate_get_feature_names_out generates all new names, i.e. pca0, pca1, etc.

@lorentzenchr lorentzenchr merged commit 8f621ad into scikit-learn:main Oct 21, 2021
@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
@timotk
Copy link

timotk commented Oct 28, 2021

FYI: This PR has been merged into the main branch, but is NOT included in the 1.0.1 release. I thought it was included because it was linked in the 1.0.1 release, but the specific commit is actually dropped and not picked:

drop 8f621ad ENH Adds feature_names_out to impute module (#21078)

Took me quite a while to figure that out.

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
@ademyanchuk
Copy link

ademyanchuk commented Jan 20, 2022

FYI: This PR has been merged into the main branch, but is NOT included in the 1.0.1 release

It is still not included 1.0.2. Should it be this way?

@timotk
Copy link

timotk commented Jan 20, 2022

@ademyanchuk Yes, it should be this way. I believe scikit-learn follows the process of semantic versioning:

  1. The first number increases with major new functionality/changes
  2. the second number for minor new features/changes.
  3. third number is reserved for fixes/patches, so no new features.

@ademyanchuk
Copy link

Thank you, @timotk
Make sense, I have already switched to nightly for now 😊

@lorentzenchr
Copy link
Member

As info: Scikit-learn does not follow SemVer. But new features as this one, are usually rolled out in a minor version like 1.1.0 or 1.2.0. Bugfix versions like 1.0.1 are for bugfixes only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement get_feature_names_out for SimpleImputer
5 participants