Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Implements get_feature_names_out for transformers that support get_feature_names #18444

Merged
merged 121 commits into from
Sep 7, 2021
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
ab2acbd
work on get_feature_names for pipeline
amueller Nov 20, 2018
3bc674b
fix SimpleImputer get_feature_names
amueller Nov 20, 2018
1c4a78f
use hasattr(transform) to check whether to use final estimator in get…
amueller Nov 20, 2018
7881930
add some docstrings
amueller Nov 20, 2018
de63353
fix docstring
amueller Nov 27, 2018
8835f3b
Merge branch 'master' into pipeline_get_feature_names
amueller Feb 27, 2019
2eba5de
fix merge issues with master
amueller May 30, 2019
449ed23
fix merge issue
amueller May 31, 2019
a1fcf67
Merge branch 'master' into pipeline_get_feature_names
amueller May 21, 2020
b929341
don't do magic slicing in pipeline.get_feature_names
amueller May 21, 2020
2b613e5
fix merge issue
amueller May 21, 2020
ad66b86
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
amueller May 24, 2020
5eb7603
trying to merge with input feature pr
amueller Jun 2, 2020
f4f832a
Merge branch 'master' into pipeline_get_feature_names
amueller Jun 2, 2020
3a9054c
remove tests taht don't apply
amueller Jun 2, 2020
9c4420d
Merge branch 'pipeline_get_feature_names' of github.com:amueller/scik…
amueller Jun 2, 2020
76f5b54
fix onetoone mixing feature names
amueller Jun 2, 2020
52f38e1
remove more tests
amueller Jun 2, 2020
cdda1fb
fix test for better expected outputs
amueller Jun 2, 2020
5f4abbc
fix priorities in catch-all get_feature_names
amueller Jun 2, 2020
4305a28
flake8
amueller Jun 2, 2020
c387b5b
remove redundant code
amueller Jun 2, 2020
2fefb67
fix error message
amueller Jun 2, 2020
a6832c3
fix mixin order
amueller Jun 2, 2020
0f45b22
small refactor with helper function
amueller Jun 2, 2020
4717a73
linting for new options
amueller Jun 3, 2020
a658ba7
add feature names to lineardiscriminantanalysis and birch
amueller Jun 3, 2020
e9e45af
add get_feature_names in a couple more places
amueller Jun 3, 2020
5acaced
fix up docs
amueller Jun 3, 2020
0353f69
make example actually work
amueller Jun 3, 2020
17a5016
Merge remote-tracking branch 'upstream/master' into pr/12627
thomasjpfan Sep 22, 2020
bb07886
ENH Converts to get_output_names
thomasjpfan Sep 23, 2020
4e0968c
CLN Move deprecations
thomasjpfan Sep 23, 2020
95046a0
WIP Deprecates dictvect get_feature_names
thomasjpfan Sep 23, 2020
f7aa3fd
WIP Deprecates text get_feature_names
thomasjpfan Sep 23, 2020
f4a9882
WIP Deprecates polyfeature.get_feature_names
thomasjpfan Sep 23, 2020
fa4b318
WIP Deprecates one hot encoder get_feature_names
thomasjpfan Sep 23, 2020
640ad76
ENH Adds get_output_names to all transformers
thomasjpfan Sep 23, 2020
d9d2d95
ENH Add get_output_names everywhere
thomasjpfan Sep 23, 2020
f6075ca
STY Lint fixes
thomasjpfan Sep 23, 2020
922748f
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 23, 2020
1af211c
TST Adds test for missing indicator
thomasjpfan Sep 23, 2020
9ab0cf9
REV Revert changes
thomasjpfan Sep 23, 2020
2926492
TST Fixes
thomasjpfan Sep 23, 2020
c1a1778
BUG Fixes missing indicator
thomasjpfan Sep 23, 2020
37101b0
TST Fixes test
thomasjpfan Sep 23, 2020
82d0a60
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 28, 2020
8833b5b
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 28, 2020
adcc1c1
TST Adds test filtering
thomasjpfan Sep 30, 2020
9a07816
CLN Change to get_feature_names_out
thomasjpfan Sep 30, 2020
0d3bc4e
CLN Reduces the number of diffs
thomasjpfan Sep 30, 2020
b922fa4
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Sep 30, 2020
21cbfe6
CLN Reduces the number of diffs
thomasjpfan Sep 30, 2020
86887ae
CLN Less diffs
thomasjpfan Sep 30, 2020
8ecb38f
CLN Refactor into _get_feature_names_out
thomasjpfan Sep 30, 2020
8b3c856
STY Lint fixes
thomasjpfan Sep 30, 2020
5260d7d
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Oct 1, 2020
cf1ec1e
CLN Remove example since get_names is not implemented everywhere
thomasjpfan Oct 1, 2020
a63cd14
ENH Adds feature_selection for the example
thomasjpfan Oct 1, 2020
dddb4a8
TST Fixes KBins
thomasjpfan Oct 2, 2020
a87866b
Merge remote-tracking branch 'upstream/master' into get_output_names
thomasjpfan Oct 5, 2020
6f35c0c
DOC Update glossary
thomasjpfan Oct 5, 2020
526db41
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Jun 30, 2021
f7c0062
STY Runs black
thomasjpfan Jun 30, 2021
9722c08
CLN Adjust diff
thomasjpfan Jun 30, 2021
ba3aca2
CLN Stricter capturing
thomasjpfan Jun 30, 2021
c78967a
DOC Adds whats new
thomasjpfan Jun 30, 2021
f022a1b
TST Fixes errosr
thomasjpfan Jun 30, 2021
f10da10
CLN Address comments
thomasjpfan Jul 1, 2021
d8bafb3
TST Increases test coverage
thomasjpfan Jul 1, 2021
8751296
DOC More docstrings
thomasjpfan Jul 1, 2021
be3f0b1
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Jul 9, 2021
84dc208
TST Fixes error message
thomasjpfan Jul 9, 2021
f41a40e
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 17, 2021
149c4e3
CLN Improves test
thomasjpfan Aug 18, 2021
41d0bb1
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 24, 2021
628a2b3
TST Fix exception type
thomasjpfan Aug 24, 2021
d178069
Merge main
ogrisel Aug 28, 2021
faae557
Fix remaining occurrence of _feature_names_in
ogrisel Aug 28, 2021
02a25be
cosmit
ogrisel Aug 28, 2021
20ecd70
Attempt to fix numpydoc failure
ogrisel Aug 28, 2021
c6bc0ce
DOC Use ndarray of string
thomasjpfan Aug 28, 2021
b76fd41
DOC Update doc to use string
thomasjpfan Aug 28, 2021
d60c4fa
DOC More docstring fixes
thomasjpfan Aug 28, 2021
4a00562
TST Adds failing test
thomasjpfan Aug 28, 2021
d5b72de
ENH Restrict to str and ndarrays
thomasjpfan Aug 28, 2021
a0b7446
ENH Convert ints to strs in dictvectorizer
thomasjpfan Aug 28, 2021
d8f84b3
ENH Uses feature_names_in_ in get_feature_names_out
thomasjpfan Aug 28, 2021
f1090df
TST Typo
thomasjpfan Aug 28, 2021
ae46466
TST Include transformers that define get_feature_names_out
thomasjpfan Aug 28, 2021
2e2bdd8
BUG Fixes test for all array outputs
thomasjpfan Aug 29, 2021
5575857
ENH Adds prefix_feature_names_out='when_colliding'
thomasjpfan Aug 29, 2021
3d1546b
CLN Cleaner code
thomasjpfan Aug 29, 2021
6e44a52
ENH Validates prefix_feature_names_out
thomasjpfan Aug 29, 2021
b07a3bc
ENH convert to ndarray for vectorizers
thomasjpfan Aug 29, 2021
fffabf0
ENH Less restrictive ndarray dtype
thomasjpfan Aug 29, 2021
ecec556
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Aug 31, 2021
5def4ce
ENH Adds prefix_feature_names_out as a bool
thomasjpfan Aug 31, 2021
1fda1c1
DOC Remove use of deprecated api
thomasjpfan Aug 31, 2021
9b8834b
DOC Update example with new api
thomasjpfan Aug 31, 2021
9034a53
ENH More consistent input_features checking
thomasjpfan Aug 31, 2021
9081ebd
WIP Better
thomasjpfan Aug 31, 2021
a4ce567
ENH Add prefix_features_names_out to make_column_transformer
thomasjpfan Aug 31, 2021
12a2052
ENH Use in one example
thomasjpfan Aug 31, 2021
aece402
REV Remove
thomasjpfan Aug 31, 2021
ea31c18
CLN Adjust name
thomasjpfan Aug 31, 2021
13d406b
DOC Adjust docstring
thomasjpfan Aug 31, 2021
d04ecec
CLN Remove unneeded code
thomasjpfan Aug 31, 2021
d3cc5b6
DOC Better docstring
thomasjpfan Aug 31, 2021
76be321
TST Fix
thomasjpfan Aug 31, 2021
caff15b
FIX test_docstring for deprecated get_feature_names
ogrisel Sep 1, 2021
dcd685f
Merge branch 'main' into get_output_names
ogrisel Sep 1, 2021
d930b1b
ENH Error when n_features_in_ is not defined
thomasjpfan Sep 1, 2021
b379cd3
DOC Update docstring
thomasjpfan Sep 1, 2021
83d12ec
CLN Address comments
thomasjpfan Sep 1, 2021
0841049
Merge remote-tracking branch 'upstream/main' into get_output_names
thomasjpfan Sep 2, 2021
8d0b3cf
Update sklearn/pipeline.py
lorentzenchr Sep 6, 2021
c35f7aa
Update doc/glossary.rst
lorentzenchr Sep 6, 2021
ec8b825
Update doc/glossary.rst
lorentzenchr Sep 6, 2021
560c0d0
ENH Adds one-to-one transformers
thomasjpfan Sep 6, 2021
043540b
Add one more test for one-to-one feature transformers with pandas
ogrisel Sep 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -868,6 +868,7 @@ Class APIs and Estimator Types
* :term:`fit`
* :term:`transform`
* :term:`get_feature_names`
* :term:`get_output_names`

meta-estimator
meta-estimators
Expand Down Expand Up @@ -1236,6 +1237,14 @@ Methods
to the names of input columns from which output column names can
be generated. By default input features are named x0, x1, ....

``get_output_names``
Primarily for :term:`feature extractors`, but also used for other
transformers to provide string names for each column in the output of
the estimator's :term:`transform` method. It outputs a list of
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
strings and may take a list of strings as input, corresponding
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
to the names of input columns from which output column names can
be generated. By default input features are named x0, x1, ....
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved

``get_n_splits``
On a :term:`CV splitter` (not an estimator), returns the number of
elements one would get if iterating through the return value of
Expand Down
37 changes: 29 additions & 8 deletions doc/modules/compose.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,27 @@ or by name::
>>> pipe['reduce_dim']
PCA()

To enable model inspection, `Pipeline` has an ``get_output_names()`` method,
ogrisel marked this conversation as resolved.
Show resolved Hide resolved
just like all transformers. You can use pipeline slicing to get the feature names
going into each step::

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> iris = load_iris()
>>> pipe = Pipeline(steps=[
... ('select', SelectKBest(k=2)),
... ('clf', LogisticRegression())])
>>> pipe.fit(iris.data, iris.target)
Pipeline(steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])
>>> pipe[:-1].get_output_names()
array(['x2', 'x3'], dtype='<U2')

You can also provide custom feature names for a more human readable format using
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
``get_output_names``::

>>> pipe[:-1].get_output_names(iris.feature_names)
array(['petal length (cm)', 'petal width (cm)'], dtype='<U17')

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_feature_selection_plot_feature_selection_pipeline.py`
Expand Down Expand Up @@ -426,21 +447,21 @@ By default, the remaining rating columns are ignored (``remainder='drop'``)::
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.preprocessing import OneHotEncoder
>>> column_trans = ColumnTransformer(
... [('city_category', OneHotEncoder(dtype='int'),['city']),
... [('categories', OneHotEncoder(dtype='int'),['city']),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
... [('categories', OneHotEncoder(dtype='int'),['city']),
... [('categories', OneHotEncoder(dtype='int'), ['city']),

Also I think I would like "categorical" best instead of "categories". But not strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or could we have an option in ColumnTransformer to no prefix the output feature names when there is no colliding names?

I think 99% of the time those prefixes would add unnecessary verbosity. The default could be to only use the prefix for colliding feature names with an option

The extra parameter could be feature_name_prefix taking values in {"only_when_colliding", "always"} and "only_when_colliding" would be the default.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree having no prefixes would work for most use cases and would be okay with the extra parameter.

We would need to update SLEP007 about prefixes in ColumnTransformer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says

ColumnTransformer by default adds a prefix to the output feature names, indicating the name of the transformer applied to them.

So we could add it without changing the SLEP, but if we want when_collinding the default, then I guess we have to change the slep. I think we discussed this at some point. @adrinjalali @jnothman might remember?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't mind the way it is right now, it's such a huge improvement. Are you concerned that changing this in the future will be an incompatible change? Feature names are probably tricky to change, but we could do a deprecation cycle for when_colliding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do the change now if we all agree it is a usability (readability) improvement. Having to deal with backward compat to change that latter would be a mess.

... ('title_bow', CountVectorizer(), 'title')],
... remainder='drop')

>>> column_trans.fit(X)
ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
ColumnTransformer(transformers=[('categories', OneHotEncoder(dtype='int'),
['city']),
('title_bow', CountVectorizer(), 'title')])

>>> column_trans.get_feature_names()
['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
'title_bow__wrath']
>>> column_trans.get_output_names()
['categories__city_London', 'categories__city_Paris',
'categories__city_Sallisaw', 'title_bow__bow', 'title_bow__feast',
'title_bow__grapes', 'title_bow__his', 'title_bow__how', 'title_bow__last',
'title_bow__learned', 'title_bow__moveable', 'title_bow__of', 'title_bow__the',
'title_bow__trick', 'title_bow__watson', 'title_bow__wrath']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed a line above for a cosmit and as a result it hides an important discussion that was happening here. Sorry for that.

The discussion was about: shall we add a new constructor parameter to skip the verbose categories__ prefix in this doctest when there is no risk of colliding feature names? It could be named:

  • prefix_feature_names="when_colliding" (default)
  • prefix_feature_names="always"

If we do so, we need to amend the following paragraph of SLEP7:

ColumnTransformer by default adds a prefix to the output feature names, indicating the name of the transformer applied to them. If a column is in the output as a part of passthrough, it won't be prefixed since no operation has been applied on it.

I am in favor of making the change now to avoid problems with backward compat if we want to do this change later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding it now, I'm not entirely sure about the default +0.5 on that, I guess? In particular it means that adding transformers can change the names of existing features. The typical use-case where you wouldn't want prefixes is where you have separate feature types like categorical, continuous and bow, like here?


>>> column_trans.transform(X).toarray()
array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
Expand Down
18 changes: 9 additions & 9 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ is a traditional numerical feature::
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])

>>> vec.get_feature_names()
>>> vec.get_output_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

:class:`DictVectorizer` accepts multiple string values for one
Expand All @@ -69,7 +69,7 @@ and its year of release.
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
>>> vec.get_feature_names() == ['category=animation', 'category=drama',
>>> vec.get_output_names() == ['category=animation', 'category=drama',
... 'category=family', 'category=thriller',
... 'year']
True
Expand Down Expand Up @@ -111,7 +111,7 @@ suitable for feeding into a classifier (maybe after being piped into a
with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[1., 1., 1., 1., 1., 1.]])
>>> vec.get_feature_names()
>>> vec.get_output_names()
['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']

As you can imagine, if one extracts such a context around each individual
Expand Down Expand Up @@ -340,7 +340,7 @@ Each term found by the analyzer during the fit is assigned a unique
integer index corresponding to a column in the resulting matrix. This
interpretation of the columns can be retrieved as follows::

>>> vectorizer.get_feature_names() == (
>>> vectorizer.get_output_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
Expand Down Expand Up @@ -406,8 +406,8 @@ however, similar words are useful for prediction, such as in classifying
writing style or personality.

There are several known issues in our provided 'english' stop word list. It
does not aim to be a general, 'one-size-fits-all' solution as some tasks
may require a more custom solution. See [NQY18]_ for more details.
does not aim to be a general, 'one-size-fits-all' solution as some tasks
may require a more custom solution. See [NQY18]_ for more details.

Please take care in choosing a stop word list.
Popular stop word lists may include words that are highly informative to
Expand Down Expand Up @@ -742,7 +742,7 @@ decide better::

>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
>>> ngram_vectorizer.get_output_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
Expand All @@ -758,15 +758,15 @@ span across words::
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
>>> ngram_vectorizer.get_output_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True

>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
>>> ngram_vectorizer.get_output_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True

Expand Down
6 changes: 3 additions & 3 deletions examples/applications/plot_topics_extraction_with_nmf_lda.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def plot_top_words(model, feature_names, n_top_words, title):
print("done in %0.3fs." % (time() - t0))


tfidf_feature_names = tfidf_vectorizer.get_feature_names()
tfidf_feature_names = tfidf_vectorizer.get_output_names()
plot_top_words(nmf, tfidf_feature_names, n_top_words,
'Topics in NMF model (Frobenius norm)')

Expand All @@ -117,7 +117,7 @@ def plot_top_words(model, feature_names, n_top_words, title):
l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

tfidf_feature_names = tfidf_vectorizer.get_feature_names()
tfidf_feature_names = tfidf_vectorizer.get_output_names()
plot_top_words(nmf, tfidf_feature_names, n_top_words,
'Topics in NMF model (generalized Kullback-Leibler divergence)')

Expand All @@ -132,5 +132,5 @@ def plot_top_words(model, feature_names, n_top_words, title):
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

tf_feature_names = tf_vectorizer.get_feature_names()
tf_feature_names = tf_vectorizer.get_output_names()
plot_top_words(lda, tf_feature_names, n_top_words, 'Topics in LDA model')
2 changes: 1 addition & 1 deletion examples/bicluster/plot_bicluster_newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def build_tokenizer(self):
time() - start_time,
v_measure_score(y_kmeans, y_true)))

feature_names = vectorizer.get_feature_names()
feature_names = vectorizer.get_output_names()
document_names = list(newsgroups.target_names[i] for i in newsgroups.target)


Expand Down
44 changes: 44 additions & 0 deletions examples/compose/plot_column_transformer_mixed_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,50 @@
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))


# %%
# Inspecting the coefficients values of the classifier
ogrisel marked this conversation as resolved.
Show resolved Hide resolved
###############################################################################
# The coefficients of the final classification step of the pipeline gives an
# idea how each feature impacts the likelihood of survival assuming that the
# usual linear model assumptions hold (uncorrelated features, linear
# separability, homoschedastic errors...) which we do not verify in this
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
# example.
#
# To get error bars we perform cross-validation and compute the mean and
# standard deviation for each coefficient accross CV splits. Because we use a
# standard scaler on the numerical features, the coefficient weights gives us
# an idea on how much the log odds of surviving are impacted by a change in
# this dimension contrasted to the mean. Note that the categorical features
# here are overspecified which makes it slightly harder to interpret because of
# the information redundancy.
#
# We can see that the linear model coefficients are in agreement with the
# historical reports: people in higher classes and therefore in the upper decks
# were the first to reach the lifeboats, and often, priority was given to women
# and children.
#
# Note that conditionned on the "pclass_x" one-hot features, the "fare"
ogrisel marked this conversation as resolved.
Show resolved Hide resolved
# numerical feature does not seem to be significantly predictive. If we drop
# the "pclass" feature, then higher "fare" values would appear significantly
# correlated with a higher likelihood of survival as the "fare" and "pclass"
# features have a strong statistical dependency.

import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedShuffleSplit

cv = StratifiedShuffleSplit(n_splits=20, test_size=0.25, random_state=42)
cv_results = cross_validate(clf, X_train, y_train, cv=cv,
return_estimator=True)
cv_coefs = np.concatenate([cv_pipeline[-1].coef_
for cv_pipeline in cv_results["estimator"]])
fig, ax = plt.subplots()
ax.barh(clf[:-1].get_output_names(),
cv_coefs.mean(axis=0), xerr=cv_coefs.std(axis=0))
plt.tight_layout()
plt.show()

# %%
# The resulting score is not exactly the same as the one from the previous
# pipeline becase the dtype-based selector treats the ``pclass`` columns as
Expand Down
9 changes: 6 additions & 3 deletions examples/feature_selection/plot_feature_selection_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
Using a sub-pipeline, the fitted coefficients can be mapped back into
the original feature space.
"""
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, f_classif
Expand All @@ -20,7 +21,7 @@

# import some data to play with
X, y = make_classification(
n_features=20, n_informative=3, n_redundant=0, n_classes=4,
n_features=20, n_informative=3, n_redundant=0, n_classes=2,
n_clusters_per_class=2)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Expand All @@ -36,5 +37,7 @@
y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

coef = anova_svm[:-1].inverse_transform(anova_svm['linearsvc'].coef_)
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
print(coef)
# access and plot the coefficients of the fitted model
plt.barh((0, 1, 2), anova_svm[-1].coef_.ravel())
plt.yticks((0, 1, 2), anova_svm[:-1].get_output_names())
plt.show()
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@

feature_names = (model.named_steps['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(input_features=categorical_columns))
.get_output_names(input_features=categorical_columns))
feature_names = np.concatenate(
[feature_names, numerical_columns])

Expand Down
2 changes: 1 addition & 1 deletion examples/inspection/plot_permutation_importance.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
ohe = (rf.named_steps['preprocess']
.named_transformers_['cat']
.named_steps['onehot'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = ohe.get_output_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]

tree_feature_importances = (
Expand Down
2 changes: 1 addition & 1 deletion examples/text/plot_document_classification_20newsgroups.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def size_mb(docs):
if opts.use_hashing:
feature_names = None
else:
feature_names = vectorizer.get_feature_names()
feature_names = vectorizer.get_output_names()

if opts.select_chi2:
print("Extracting %d best features by a chi-squared test" %
Expand Down
2 changes: 1 addition & 1 deletion examples/text/plot_document_clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ def is_interactive():
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
terms = vectorizer.get_output_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
Expand Down
2 changes: 1 addition & 1 deletion examples/text/plot_hashing_vs_dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def token_freqs(doc):
vectorizer.fit_transform(token_freqs(d) for d in raw_data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
print("Found %d unique terms" % len(vectorizer.get_feature_names()))
print("Found %d unique terms" % len(vectorizer.get_output_names()))
print()

print("FeatureHasher on frequency dicts")
Expand Down
29 changes: 29 additions & 0 deletions sklearn/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .utils import _IS_32BIT
from .utils.validation import check_X_y
from .utils.validation import check_array
from .utils._feature_names import _make_feature_names
from .utils._estimator_html_repr import estimator_html_repr
from .utils.validation import _deprecate_positional_args

Expand Down Expand Up @@ -747,6 +748,34 @@ def fit_predict(self, X, y=None):
return self.fit(X).predict(X)


class OneToOneMixin:
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
"""Provides get_feature_names for simple transformers
ogrisel marked this conversation as resolved.
Show resolved Hide resolved

Assumes there's a 1-to-1 correspondence between input features
and output features.
"""

def get_output_names(self, input_features=None):
"""Get output feature names for transformation.

Returns input_features as this transformation
lorentzenchr marked this conversation as resolved.
Show resolved Hide resolved
doesn't add or drop features.

Parameters
----------
input_features : array-like of str or None, default=None
Input features. If None, they are generated as
x0, x1, ..., xn_features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x0, x1, ..., xn_features.
`[x0, x1, ..., xn_features]`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either always with or always without those ticks.


Returns
-------
feature_names : array-like of str
Transformed feature names.
"""
return _make_feature_names(self.n_features_in_,
input_features=input_features)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an array-like because when input_features is not None then this returns input_features without any processing.



class MetaEstimatorMixin:
_required_parameters = ["estimator"]
"""Mixin class for all meta estimators in scikit-learn."""
Expand Down
17 changes: 17 additions & 0 deletions sklearn/cluster/_agglomerative.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from ..neighbors._dist_metrics import METRIC_MAPPING
from ..utils import check_array
from ..utils._fast_dict import IntFloatDict
from ..utils._feature_names import _make_feature_names
from ..utils.fixes import _astype_copy_false
from ..utils.validation import _deprecate_positional_args, check_memory
# mypy error: Module 'sklearn.cluster' has no attribute '_hierarchical_fast'
Expand Down Expand Up @@ -945,6 +946,22 @@ def fit_predict(self, X, y=None):
"""
return super().fit_predict(X, y)

def get_output_names(self, input_features=None):
"""Get output feature names.

Parameters
----------
input_features : array-like of str or None, default=None
Not used, present here for API consistency by convention.

Returns
-------
output_feature_names : list of str
Feature names for transformer output.
"""
return _make_feature_names(n_features=self.n_clusters,
prefix=type(self).__name__.lower())


class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
"""Agglomerate features.
Expand Down