Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Implements get_feature_names_out for transformers that support get_feature_names #18444

Merged
merged 121 commits into from
Sep 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
ab2acbd
work on get_feature_names for pipeline
amueller Nov 20, 2018
3bc674b
fix SimpleImputer get_feature_names
amueller Nov 20, 2018
1c4a78f
use hasattr(transform) to check whether to use final estimator in get…
amueller Nov 20, 2018
7881930
add some docstrings
amueller Nov 20, 2018
de63353
fix docstring
amueller Nov 27, 2018
8835f3b
Merge branch 'master' into pipeline_get_feature_names
amueller Feb 27, 2019
2eba5de
fix merge issues with master
amueller May 30, 2019
449ed23
fix merge issue
amueller May 31, 2019
a1fcf67
Merge branch 'master' into pipeline_get_feature_names
amueller May 21, 2020
b929341
don't do magic slicing in pipeline.get_feature_names
amueller May 21, 2020
2b613e5
fix merge issue
amueller May 21, 2020
ad66b86
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
amueller May 24, 2020
5eb7603
trying to merge with input feature pr
amueller Jun 2, 2020
f4f832a
Merge branch 'master' into pipeline_get_feature_names
amueller Jun 2, 2020
3a9054c
remove tests taht don't apply
amueller Jun 2, 2020
9c4420d
Merge branch 'pipeline_get_feature_names' of github.com:amueller/scik…
amueller Jun 2, 2020
76f5b54
fix onetoone mixing feature names
amueller Jun 2, 2020
52f38e1
remove more tests
amueller Jun 2, 2020
cdda1fb
fix test for better expected outputs
amueller Jun 2, 2020
5f4abbc
fix priorities in catch-all get_feature_names
amueller Jun 2, 2020
4305a28
flake8
amueller Jun 2, 2020
c387b5b
remove redundant code
amueller Jun 2, 2020
2fefb67
fix error message
amueller Jun 2, 2020
a6832c3
fix mixin order
amueller Jun 2, 2020
0f45b22
small refactor with helper function
amueller Jun 2, 2020
4717a73
linting for new options
amueller Jun 3, 2020
a658ba7
add feature names to lineardiscriminantanalysis and birch
amueller Jun 3, 2020
e9e45af
add get_feature_names in a couple more places
amueller Jun 3, 2020
5acaced
fix up docs
amueller Jun 3, 2020
0353f69
make example actually work
amueller Jun 3, 2020
17a5016
Merge remote-tracking branch 'upstream/master' into pr/12627
thomasjpfan Sep 22, 2020
bb07886
ENH Converts to get_output_names
thomasjpfan Sep 23, 2020
4e0968c
CLN Move deprecations
thomasjpfan Sep 23, 2020
95046a0
WIP Deprecates dictvect get_feature_names
thomasjpfan Sep 23, 2020
f7aa3fd
WIP Deprecates text get_feature_names
thomasjpfan Sep 23, 2020
f4a9882
WIP Deprecates polyfeature.get_feature_names
thomasjpfan Sep 23, 2020
fa4b318
WIP Deprecates one hot encoder get_feature_names
thomasjpfan Sep 23, 2020
640ad76
ENH Adds get_output_names to all transformers
thomasjpfan Sep 23, 2020
d9d2d95
ENH Add get_output_names everywhere
thomasjpfan Sep 23, 2020
f6075ca
STY Lint fixes
thomasjpfan Sep 23, 2020
922748f
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 23, 2020
1af211c
TST Adds test for missing indicator
thomasjpfan Sep 23, 2020
9ab0cf9
REV Revert changes
thomasjpfan Sep 23, 2020
2926492
TST Fixes
thomasjpfan Sep 23, 2020
c1a1778
BUG Fixes missing indicator
thomasjpfan Sep 23, 2020
37101b0
TST Fixes test
thomasjpfan Sep 23, 2020
82d0a60
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 28, 2020
8833b5b
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 28, 2020
adcc1c1
TST Adds test filtering
thomasjpfan Sep 30, 2020
9a07816
CLN Change to get_feature_names_out
thomasjpfan Sep 30, 2020
0d3bc4e
CLN Reduces the number of diffs
thomasjpfan Sep 30, 2020
b922fa4
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 30, 2020
21cbfe6
CLN Reduces the number of diffs
thomasjpfan Sep 30, 2020
86887ae
CLN Less diffs
thomasjpfan Sep 30, 2020
8ecb38f
CLN Refactor into _get_feature_names_out
thomasjpfan Sep 30, 2020
8b3c856
STY Lint fixes
thomasjpfan Sep 30, 2020
5260d7d
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Oct 1, 2020
cf1ec1e
CLN Remove example since get_names is not implemented everywhere
thomasjpfan Oct 1, 2020
a63cd14
ENH Adds feature_selection for the example
thomasjpfan Oct 1, 2020
dddb4a8
TST Fixes KBins
thomasjpfan Oct 2, 2020
a87866b
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Oct 5, 2020
6f35c0c
DOC Update glossary
thomasjpfan Oct 5, 2020
526db41
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Jun 30, 2021
f7c0062
STY Runs black
thomasjpfan Jun 30, 2021
9722c08
CLN Adjust diff
thomasjpfan Jun 30, 2021
ba3aca2
CLN Stricter capturing
thomasjpfan Jun 30, 2021
c78967a
DOC Adds whats new
thomasjpfan Jun 30, 2021
f022a1b
TST Fixes errosr
thomasjpfan Jun 30, 2021
f10da10
CLN Address comments
thomasjpfan Jul 1, 2021
d8bafb3
TST Increases test coverage
thomasjpfan Jul 1, 2021
8751296
DOC More docstrings
thomasjpfan Jul 1, 2021
be3f0b1
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Jul 9, 2021
84dc208
TST Fixes error message
thomasjpfan Jul 9, 2021
f41a40e
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 17, 2021
149c4e3
CLN Improves test
thomasjpfan Aug 18, 2021
41d0bb1
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 24, 2021
628a2b3
TST Fix exception type
thomasjpfan Aug 24, 2021
d178069
Merge main
ogrisel Aug 28, 2021
faae557
Fix remaining occurrence of _feature_names_in
ogrisel Aug 28, 2021
02a25be
cosmit
ogrisel Aug 28, 2021
20ecd70
Attempt to fix numpydoc failure
ogrisel Aug 28, 2021
c6bc0ce
DOC Use ndarray of string
thomasjpfan Aug 28, 2021
b76fd41
DOC Update doc to use string
thomasjpfan Aug 28, 2021
d60c4fa
DOC More docstring fixes
thomasjpfan Aug 28, 2021
4a00562
TST Adds failing test
thomasjpfan Aug 28, 2021
d5b72de
ENH Restrict to str and ndarrays
thomasjpfan Aug 28, 2021
a0b7446
ENH Convert ints to strs in dictvectorizer
thomasjpfan Aug 28, 2021
d8f84b3
ENH Uses feature_names_in_ in get_feature_names_out
thomasjpfan Aug 28, 2021
f1090df
TST Typo
thomasjpfan Aug 28, 2021
ae46466
TST Include transformers that define get_feature_names_out
thomasjpfan Aug 28, 2021
2e2bdd8
BUG Fixes test for all array outputs
thomasjpfan Aug 29, 2021
5575857
ENH Adds prefix_feature_names_out='when_colliding'
thomasjpfan Aug 29, 2021
3d1546b
CLN Cleaner code
thomasjpfan Aug 29, 2021
6e44a52
ENH Validates prefix_feature_names_out
thomasjpfan Aug 29, 2021
b07a3bc
ENH convert to ndarray for vectorizers
thomasjpfan Aug 29, 2021
fffabf0
ENH Less restrictive ndarray dtype
thomasjpfan Aug 29, 2021
ecec556
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 31, 2021
5def4ce
ENH Adds prefix_feature_names_out as a bool
thomasjpfan Aug 31, 2021
1fda1c1
DOC Remove use of deprecated api
thomasjpfan Aug 31, 2021
9b8834b
DOC Update example with new api
thomasjpfan Aug 31, 2021
9034a53
ENH More consistent input_features checking
thomasjpfan Aug 31, 2021
9081ebd
WIP Better
thomasjpfan Aug 31, 2021
a4ce567
ENH Add prefix_features_names_out to make_column_transformer
thomasjpfan Aug 31, 2021
12a2052
ENH Use in one example
thomasjpfan Aug 31, 2021
aece402
REV Remove
thomasjpfan Aug 31, 2021
ea31c18
CLN Adjust name
thomasjpfan Aug 31, 2021
13d406b
DOC Adjust docstring
thomasjpfan Aug 31, 2021
d04ecec
CLN Remove unneeded code
thomasjpfan Aug 31, 2021
d3cc5b6
DOC Better docstring
thomasjpfan Aug 31, 2021
76be321
TST Fix
thomasjpfan Aug 31, 2021
caff15b
FIX test_docstring for deprecated get_feature_names
ogrisel Sep 1, 2021
dcd685f
Merge branch 'main' into get_output_names
ogrisel Sep 1, 2021
d930b1b
ENH Error when n_features_in_ is not defined
thomasjpfan Sep 1, 2021
b379cd3
DOC Update docstring
thomasjpfan Sep 1, 2021
83d12ec
CLN Address comments
thomasjpfan Sep 1, 2021
0841049
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Sep 2, 2021
8d0b3cf
Update sklearn/pipeline.py
lorentzenchr Sep 6, 2021
c35f7aa
Update doc/glossary.rst
lorentzenchr Sep 6, 2021
ec8b825
Update doc/glossary.rst
lorentzenchr Sep 6, 2021
560c0d0
ENH Adds one-to-one transformers
thomasjpfan Sep 6, 2021
043540b
Add one more test for one-to-one feature transformers with pandas
ogrisel Sep 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
12 changes: 12 additions & 0 deletions doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -894,6 +894,7 @@ Class APIs and Estimator Types
* :term:`fit`
* :term:`transform`
* :term:`get_feature_names`
* :term:`get_feature_names_out`

meta-estimator
meta-estimators
Expand Down Expand Up @@ -1262,6 +1263,17 @@ Methods
to the names of input columns from which output column names can
be generated. By default input features are named x0, x1, ....

``get_feature_names_out``
Primarily for :term:`feature extractors`, but also used for other
transformers to provide string names for each column in the output of
the estimator's :term:`transform` method. It outputs an array of
strings and may take an array-like of strings as input, corresponding
to the names of input columns from which output column names can
be generated. If `input_features` is not passed in, then the
`feature_names_in_` attribute will be used. If the
`feature_names_in_` attribute is not defined, then the
input names are named `[x0, x1, ..., x(n_features_in_)]`.

``get_n_splits``
On a :term:`CV splitter` (not an estimator), returns the number of
elements one would get if iterating through the return value of
Expand Down
38 changes: 29 additions & 9 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,27 @@ or by name::
>>> pipe['reduce_dim']
PCA()

To enable model inspection, :class:`~sklearn.pipeline.Pipeline` has a
``get_feature_names_out()`` method, just like all transformers. You can use
pipeline slicing to get the feature names going into each step::

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> pipe = Pipeline(steps=[
... ('select', SelectKBest(k=2)),
... ('clf', LogisticRegression())])
>>> pipe.fit(iris.data, iris.target)
Pipeline(steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])
>>> pipe[:-1].get_feature_names_out()
array(['x2', 'x3'], ...)

You can also provide custom feature names for the input data using
``get_feature_names_out``::

>>> pipe[:-1].get_feature_names_out(iris.feature_names)
array(['petal length (cm)', 'petal width (cm)'], ...)

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_feature_selection_plot_feature_selection_pipeline.py`
Expand Down Expand Up @@ -426,21 +447,20 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... [('categories', OneHotEncoder(dtype='int'), ['city']),
... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')
... remainder='drop', prefix_feature_names_out=False)

>>> column_trans.fit(X)
ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
ColumnTransformer(prefix_feature_names_out=False,
transformers=[('categories', OneHotEncoder(dtype='int'),
['city']),
('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
>>> column_trans.get_feature_names_out()
array(['city_London', 'city_Paris', 'city_Sallisaw', 'bow', 'feast',
'grapes', 'his', 'how', 'last', 'learned', 'moveable', 'of', 'the',
'trick', 'watson', 'wrath'], ...)

>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
Expand Down
44 changes: 20 additions & 24 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _feature_extraction:
.. _feature_extraction:

==================
Feature extraction
Expand Down Expand Up @@ -53,8 +53,8 @@ is a traditional numerical feature::
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])

>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
>>> vec.get_feature_names_out()
array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)

:class:`DictVectorizer` accepts multiple string values for one
feature, like, e.g., multiple categories for a movie.
Expand All @@ -69,10 +69,9 @@ and its year of release.
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
>>> vec.get_feature_names() == ['category=animation', 'category=drama',
... 'category=family', 'category=thriller',
... 'year']
True
>>> vec.get_feature_names_out()
array(['category=animation', 'category=drama', 'category=family',
'category=thriller', 'year'], ...)
>>> vec.transform({'category': ['thriller'],
... 'unseen_feature': '3'}).toarray()
array([[0., 0., 0., 1., 0.]])
Expand Down Expand Up @@ -111,8 +110,9 @@ suitable for feeding into a classifier (maybe after being piped into a
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
>>> vec.get_feature_names_out()
array(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat',
'word-2=the'], ...)

As you can imagine, if one extracts such a context around each individual
word of a corpus of documents the resulting matrix will be very wide
Expand Down Expand Up @@ -340,10 +340,9 @@ Each term found by the analyzer during the fit is assigned a unique
integer index corresponding to a column in the resulting matrix. This
interpretation of the columns can be retrieved as follows::

>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the',
'third', 'this'], ...)

>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
Expand Down Expand Up @@ -406,8 +405,8 @@ however, similar words are useful for prediction, such as in classifying
writing style or personality.

There are several known issues in our provided 'english' stop word list. It
does not aim to be a general, 'one-size-fits-all' solution as some tasks
may require a more custom solution. See [NQY18]_ for more details.
does not aim to be a general, 'one-size-fits-all' solution as some tasks
may require a more custom solution. See [NQY18]_ for more details.

Please take care in choosing a stop word list.
Popular stop word lists may include words that are highly informative to
Expand Down Expand Up @@ -742,9 +741,8 @@ decide better::

>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> ngram_vectorizer.get_feature_names_out()
array([' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'], ...)
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
Expand All @@ -758,17 +756,15 @@ span across words::
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True
>>> ngram_vectorizer.get_feature_names_out()
array([' fox ', ' jump', 'jumpy', 'umpy '], ...)

>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
>>> ngram_vectorizer.get_feature_names_out()
array(['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'], ...)

The word boundaries-aware variant ``char_wb`` is especially interesting
for languages that use white-spaces for word separation as it generates
Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ Changelog
- |API| `np.matrix` usage is deprecated in 1.0 and will raise a `TypeError` in
1.2. :pr:`20165` by `Thomas Fan`_.

- |API| :term:`get_feature_names_out` has been added to the transformer API
to get the names of the output features. :pr:`18444` by `Thomas Fan`_.

- |API| All estimators store `feature_names_in_` when fitted on pandas Dataframes.
These feature names are compared to names seen in `non-fit` methods,
`i.e.` `transform` and will raise a `FutureWarning` if they are not consistent.
Expand Down Expand Up @@ -221,6 +224,10 @@ Changelog
:mod:`sklearn.compose`
......................

- |API| Adds `prefix_feature_names_out` to :class:`compose.ColumnTransformer`.
This flag controls the prefixing of feature names out in
:term:`get_feature_names_out`. :pr:`18444` by `Thomas Fan`_.

- |Enhancement| :class:`compose.ColumnTransformer` now records the output
of each transformer in `output_indices_`. :pr:`18393` by
:user:`Luca Bittarello <lbittarello>`.
Expand Down
6 changes: 3 additions & 3 deletions examples/applications/plot_topics_extraction_with_nmf_lda.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def plot_top_words(model, feature_names, n_top_words, title):
print("done in %0.3fs." % (time() - t0))


tfidf_feature_names = tfidf_vectorizer.get_feature_names()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(nmf, tfidf_feature_names, n_top_words,
'Topics in NMF model (Frobenius norm)')

Expand All @@ -117,7 +117,7 @@ def plot_top_words(model, feature_names, n_top_words, title):
l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
plot_top_words(nmf, tfidf_feature_names, n_top_words,
'Topics in NMF model (generalized Kullback-Leibler divergence)')

Expand All @@ -132,5 +132,5 @@ def plot_top_words(model, feature_names, n_top_words, title):
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

tf_feature_names = tf_vectorizer.get_feature_names()
tf_feature_names = tf_vectorizer.get_feature_names_out()
plot_top_words(lda, tf_feature_names, n_top_words, 'Topics in LDA model')
2 changes: 1 addition & 1 deletion examples/bicluster/plot_bicluster_newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def build_tokenizer(self):
time() - start_time,
v_measure_score(y_kmeans, y_true)))

feature_names = vectorizer.get_feature_names()
feature_names = vectorizer.get_feature_names_out()
document_names = list(newsgroups.target_names[i] for i in newsgroups.target)


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,9 @@
numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"]

preprocessor = make_column_transformer(
(OneHotEncoder(drop="if_binary"), categorical_columns), remainder="passthrough"
(OneHotEncoder(drop="if_binary"), categorical_columns),
remainder="passthrough",
prefix_feature_names_out=False,
)

# %%
Expand Down Expand Up @@ -199,13 +201,7 @@
#
# First of all, we can take a look to the values of the coefficients of the
# regressor we have fitted.

feature_names = (
model.named_steps["columntransformer"]
.named_transformers_["onehotencoder"]
.get_feature_names(input_features=categorical_columns)
)
feature_names = np.concatenate([feature_names, numerical_columns])
feature_names = model[:-1].get_feature_names_out()
Comment on lines -203 to +204
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the primary use case for this feature.


coefs = pd.DataFrame(
model.named_steps["transformedtargetregressor"].regressor_.coef_,
Expand Down
2 changes: 1 addition & 1 deletion examples/inspection/plot_permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@
# capacity).
ohe = (rf.named_steps['preprocess']
.named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = ohe.get_feature_names_out(categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]

tree_feature_importances = (
Expand Down
10 changes: 3 additions & 7 deletions examples/text/plot_document_classification_20newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def size_mb(docs):
if opts.use_hashing:
feature_names = None
else:
feature_names = vectorizer.get_feature_names()
feature_names = vectorizer.get_feature_names_out()

if opts.select_chi2:
print("Extracting %d best features by a chi-squared test" %
Expand All @@ -183,16 +183,12 @@ def size_mb(docs):
ch2 = SelectKBest(chi2, k=opts.select_chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
if feature_names:
if feature_names is not None:
# keep selected feature names
feature_names = [feature_names[i] for i
in ch2.get_support(indices=True)]
feature_names = feature_names[ch2.get_support()]
print("done in %fs" % (time() - t0))
print()

if feature_names:
feature_names = np.asarray(feature_names)


def trim(s):
"""Trim string to fit on terminal (assuming 80-column display)"""
Expand Down
2 changes: 1 addition & 1 deletion examples/text/plot_document_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ def is_interactive():
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
terms = vectorizer.get_feature_names_out()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
Expand Down
2 changes: 1 addition & 1 deletion examples/text/plot_hashing_vs_dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def token_freqs(doc):
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
print("Found %d unique terms" % len(vectorizer.get_feature_names_out()))
print()

print("FeatureHasher on frequency dicts")
Expand Down
30 changes: 30 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from .utils.validation import check_array
from .utils.validation import _check_y
from .utils.validation import _num_features
from .utils.validation import _check_feature_names_in
from .utils._estimator_html_repr import estimator_html_repr
from .utils.validation import _get_feature_names

Expand Down Expand Up @@ -846,6 +847,35 @@ def fit_transform(self, X, y=None, **fit_params):
return self.fit(X, y, **fit_params).transform(X)


class _OneToOneFeatureMixin:
"""Provides `get_feature_names_out` for simple transformers.

Assumes there's a 1-to-1 correspondence between input features
and output features.
"""

def get_feature_names_out(self, input_features=None):
"""Get output feature names for transformation.

Parameters
----------
input_features : array-like of str or None, default=None
Input features.

- If `input_features` is `None`, then `feature_names_in_` is
used as feature names in. If `feature_names_in_` is not defined,
then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
- If `input_features` is an array-like, then `input_features` must
match `feature_names_in_` if `feature_names_in_` is defined.

Returns
-------
feature_names_out : ndarray of str objects
Same as input features.
"""
return _check_feature_names_in(self, input_features)


class DensityMixin:
"""Mixin class for all density estimators in scikit-learn."""

Expand Down