Allowing Feature Selection inside or before Column Transformer #2

carvalhomb · 2020-10-12T08:06:12Z

I came across another problem, maybe related to #1 , but I'm not sure if this is by design or not.

In a slightly different example than the one before, I created a toy dataframe with one column ("c") that has only null values. I want this column to be dropped inside the Column Transformer Pipeline before imputing (because an all-nan column will be silently dropped by SimpleImputer, so in my opinion it is better to have a step that explicitly does it). So the code below:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {
        'a': [123, 145, 100, np.NaN, np.NaN, 150],
        'b': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
        'd': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'e': [np.NaN, 'GE', 'US', 'GE', np.NaN, 'UK']
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['a', 'b', 'c']
categorical_features = ['e']
drop_features = ['d']

# Deal with numerical columns
numerical_transformer = Pipeline(steps=[
    ('remnulls', VarianceThreshold(threshold=0.0)),  # col c has only NaNs and should be dropped
    ('imputer', SimpleImputer(strategy='mean')),
])

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),  # remove all-nan cols before imputing
        ('cat', categorical_transformer, categorical_features), # impute + one hot encoding
        ('dropme', 'drop', drop_features),  # drop column d
    ])

# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape old dataframe: {}'.format(str(df.shape)))
print(df)

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_selected_features()

print('Shape new dataframe: {}'.format(str(transf_df.shape)))
print(transf_df)
print('Cols in new df according to FeatureImportance: ({}) {}'.format(len(new_cols), new_cols))

returns:

Shape old dataframe: (6, 5)
       a     b   c        d    e
0  123.0  10.0 NaN  Michael  NaN
1  145.0   NaN NaN  Jessica   GE
2  100.0  30.0 NaN      Sue   US
3    NaN   NaN NaN     Jake   GE
4    NaN   NaN NaN      Amy  NaN
5  150.0  20.0 NaN      Tye   UK
Shape new dataframe: (6, 6)
[[123.   10.    0.    0.    0.    1. ]
 [145.   20.    1.    0.    0.    0. ]
 [100.   30.    0.    0.    1.    0. ]
 [129.5  20.    1.    0.    0.    0. ]
 [129.5  20.    0.    0.    0.    1. ]
 [150.   20.    0.    1.    0.    0. ]]
Cols in new df according to FeatureImportance: (7) ['a', 'b', 'c', 'e_GE', 'e_UK', 'e_US', 'e_missing']

So you can see that column c was dropped from the resulting dataframe, but it is still showing in the list of features.

So, my question is, is there a way to have a Pipeline with a Feature Selection step inside a Column Transformer, or at least as a step before the Column Transformer in the outer Pipeline, to avoid the issue of the silent dropping of all-nan columns by the Imputer?

Thanks!

The text was updated successfully, but these errors were encountered:

kylegilde · 2020-10-19T14:52:56Z

Hi Maira, This is a good suggestion, and it's something that ideally this class would support, but I don't think that it would be an easy change to support this use case.

In my design, I was trying to avoid having to loop through the elements within a ColumnTransformer step because of the complications that could arise.

Something would have to change at this point:

https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L99-L101

Let me know if you think of a method to handle this, and I will think about it as well.

In the long run, I think that Scikit-Learn will develop some better ways of getting feature names out of pipelines & transformers.

kylegilde · 2020-10-20T01:14:45Z

Here is some of the work being done: scikit-learn/enhancement_proposals#48

carvalhomb · 2020-10-21T07:48:25Z

Hi Kyle,
great to know that the people in Scikit-Learn have that on their radar.

I did some work on this and I got the class working for my current purposes, but that certainly introduced other problems, because it was kind of a dirty hack. I'll fork from your repo and push my changes, then if you want you can have a look and see if it gives you any ideas. What do you say?

Cheers!

carvalhomb · 2020-10-21T08:21:34Z

Here's my first attempt. I'm sure it creates new problems, because I haven't tested it properly, but it is an idea :)
https://github.com/carvalhomb/Kaggle-Notebooks/blob/dev/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing Feature Selection inside or before Column Transformer #2

Allowing Feature Selection inside or before Column Transformer #2

carvalhomb commented Oct 12, 2020

kylegilde commented Oct 19, 2020

kylegilde commented Oct 20, 2020

carvalhomb commented Oct 21, 2020

carvalhomb commented Oct 21, 2020

Allowing Feature Selection inside or before Column Transformer #2

Allowing Feature Selection inside or before Column Transformer #2

Comments

carvalhomb commented Oct 12, 2020

kylegilde commented Oct 19, 2020

kylegilde commented Oct 20, 2020

carvalhomb commented Oct 21, 2020

carvalhomb commented Oct 21, 2020