Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing Feature Selection inside or before Column Transformer #2

Open
carvalhomb opened this issue Oct 12, 2020 · 4 comments
Open

Comments

@carvalhomb
Copy link

I came across another problem, maybe related to #1 , but I'm not sure if this is by design or not.

In a slightly different example than the one before, I created a toy dataframe with one column ("c") that has only null values. I want this column to be dropped inside the Column Transformer Pipeline before imputing (because an all-nan column will be silently dropped by SimpleImputer, so in my opinion it is better to have a step that explicitly does it). So the code below:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {
        'a': [123, 145, 100, np.NaN, np.NaN, 150],
        'b': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
        'd': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'e': [np.NaN, 'GE', 'US', 'GE', np.NaN, 'UK']
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['a', 'b', 'c']
categorical_features = ['e']
drop_features = ['d']

# Deal with numerical columns
numerical_transformer = Pipeline(steps=[
    ('remnulls', VarianceThreshold(threshold=0.0)),  # col c has only NaNs and should be dropped
    ('imputer', SimpleImputer(strategy='mean')),
])

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),  # remove all-nan cols before imputing
        ('cat', categorical_transformer, categorical_features), # impute + one hot encoding
        ('dropme', 'drop', drop_features),  # drop column d
    ])

# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape old dataframe: {}'.format(str(df.shape)))
print(df)

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_selected_features()

print('Shape new dataframe: {}'.format(str(transf_df.shape)))
print(transf_df)
print('Cols in new df according to FeatureImportance: ({}) {}'.format(len(new_cols), new_cols))

returns:

Shape old dataframe: (6, 5)
       a     b   c        d    e
0  123.0  10.0 NaN  Michael  NaN
1  145.0   NaN NaN  Jessica   GE
2  100.0  30.0 NaN      Sue   US
3    NaN   NaN NaN     Jake   GE
4    NaN   NaN NaN      Amy  NaN
5  150.0  20.0 NaN      Tye   UK
Shape new dataframe: (6, 6)
[[123.   10.    0.    0.    0.    1. ]
 [145.   20.    1.    0.    0.    0. ]
 [100.   30.    0.    0.    1.    0. ]
 [129.5  20.    1.    0.    0.    0. ]
 [129.5  20.    0.    0.    0.    1. ]
 [150.   20.    0.    1.    0.    0. ]]
Cols in new df according to FeatureImportance: (7) ['a', 'b', 'c', 'e_GE', 'e_UK', 'e_US', 'e_missing']

So you can see that column c was dropped from the resulting dataframe, but it is still showing in the list of features.

So, my question is, is there a way to have a Pipeline with a Feature Selection step inside a Column Transformer, or at least as a step before the Column Transformer in the outer Pipeline, to avoid the issue of the silent dropping of all-nan columns by the Imputer?

Thanks!

@kylegilde
Copy link
Owner

Hi Maira, This is a good suggestion, and it's something that ideally this class would support, but I don't think that it would be an easy change to support this use case.

In my design, I was trying to avoid having to loop through the elements within a ColumnTransformer step because of the complications that could arise.

Something would have to change at this point:

https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L99-L101

Let me know if you think of a method to handle this, and I will think about it as well.

In the long run, I think that Scikit-Learn will develop some better ways of getting feature names out of pipelines & transformers.

@kylegilde
Copy link
Owner

Here is some of the work being done: scikit-learn/enhancement_proposals#48

@carvalhomb
Copy link
Author

Hi Kyle,
great to know that the people in Scikit-Learn have that on their radar.

I did some work on this and I got the class working for my current purposes, but that certainly introduced other problems, because it was kind of a dirty hack. I'll fork from your repo and push my changes, then if you want you can have a look and see if it gives you any ideas. What do you say?

Cheers!

@carvalhomb
Copy link
Author

Here's my first attempt. I'm sure it creates new problems, because I haven't tested it properly, but it is an idea :)
https://github.com/carvalhomb/Kaggle-Notebooks/blob/dev/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants