Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to know which feature is selected by FeatureUnion? #6122

Closed
genliu777 opened this issue Jan 6, 2016 · 6 comments
Closed

how to know which feature is selected by FeatureUnion? #6122

genliu777 opened this issue Jan 6, 2016 · 6 comments

Comments

@genliu777
Copy link

i run the code of,
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
and with the following code,

# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

with data put into FeatureUnion, i want to know which feature is selected. in the doc of FeatureUnion, there is a funtion get_feature_names() which gets all the names from all the transformer. so just call this function and get error like this,

AttributeError: Transformer pca does not provide get_feature_names.

actually, i know pca does not have function like this. but why FeatureUnion provide this function!?

@jnothman
Copy link
Member

jnothman commented Jan 6, 2016

I agree that there should be a way to see which features belong to which
components, and I have long ago proposed this, but I don't think it's
currently possible.

On 6 January 2016 at 18:23, genliu777 notifications@github.com wrote:

i run the code of,

http://scikit-learnorg/stable/auto_examples/feature_stackerhtml#example-feature-stacker-py
and with the following code,

Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

Use combined features to transform dataset:

X_features = combined_featuresfit(X, y)transform(X)

with data put into FeatureUnion, i want to know which feature is selected
in the doc of FeatureUnion, there is a funtion get_feature_names() which
gets all the names of from all the transformer so just call this function
and get error like this,
AttributeError: Transformer pca does not provide get_feature_names
actually, i know pca does not have function like this but why FeatureUnion
provide this function!?


Reply to this email directly or view it on GitHub
#6122.

@genliu777
Copy link
Author

why currently impossible!? you know, FeatureUnion gives the function get_feature_names(), and it also should work!

like, maybe all of them, models in sklearn, have the function fit and transform, it should make all the models which can be put in FeatureUnion and work well , provide the attribute as the source code of get_feature_names() calls if not hasattr(trans, 'get_feature_names'): . otherwise, FeatureUnion do not necessarily provide the funciton of get_feature_names()!!

@joshhamanngaia
Copy link

joshhamanngaia commented Jul 7, 2017

This may not address your particular issue with PCA directly, but if I read into your question correctly, you are wondering in general how to percolate attributes through the custom pipeline.

Late to the party, but you can access elements within the pipeline, regardless how complicated, by walking through the pipeline structure, finding the appropriate step (even within featureunion) and then using the appropriate attribute. Here is an example I just ran:

pipeline = Pipeline([ ('union', FeatureUnion([ ('categoric', Pipeline([ ('f_cat', feature_type_split(type = 'categoric')), #returns categoric in array for vect ('vect', vect), ])), ('numeric', Pipeline([ ('f_num', feature_type_split(type = 'numeric')), ])), ])), ('select', ff), ('tree_clf', clf), ])

Showing the pipeline object itself via print(pipeline) gives me a point of reference:

Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('categoric', Pipeline(steps=[('f_cat', feature_type_split(type='categoric')), ('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True))])), ('numeric', Pipeline(steps=[('f_num', feature_type...it=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])

So I walk through to the union step via:

pipeline.named_steps['union']

Then walk to the next level which is transformer_list (or the categoric pipeline) via:

pipeline.named_steps['union'].transformer_list[0]

Then walk to the next level which is the steps within the categoric pipeline via:

pipeline.named_steps['union'].transformer_list[0][1]

The above outputs a typical pipeline structure, where we can now utilize named_steps:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect']

And therefore access the attribute we need via:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect'].get_feature_names()

TLDR;
Walk through the pipeline structure piece by piece with your custom pipeline, and then access the attribute as you would normally for that transform/estimator piece.

@jnothman
Copy link
Member

jnothman commented Jul 8, 2017

Please try eli5's transform_feature_names which can work in cases where scikit-learn's get_feature_names doesn't.

@markatango
Copy link

@joshhamanngaia Awesome. Thank you for not just showing what but also showing how and why.

@thomasjpfan
Copy link
Member

On main, it is now possible to call get_feature_names_out with feature union. In context of the original issue, one can call get_feature_names_out to get the feature names:

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target

pca = PCA(n_components=2)
selection = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

svm = SVC(kernel="linear")
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
pipeline.fit(X, y)

# slice the pipeline to include all steps excluding the last one
pipeline[:-1].get_feature_names_out()
# array(['pca__pca0', 'pca__pca1', 'univ_select__petal length (cm)'], dtype=object)

In 1.1, we will all transformers will define a get_feature_names_out allowing this feature to work everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants