Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimension errors when using sklearn OneHotEncoder with min_frequency parameter #545

Open
dclaz opened this issue Nov 4, 2022 · 1 comment

Comments

@dclaz
Copy link

dclaz commented Nov 4, 2022

The documentation suggests that the sklearn OneHotEncoder should be a viable transformation when using the MimicExplainer, but I'm getting errors if I use it and set the min_frequency parameter to remove category levels with low counts.

If I set up my data preprocessor like this

image

(where I have ~7 categorical features, each with many levels)

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)       
    ],
    remainder="drop",
)

I get the following error
image

However, if I set a different transformer for each categorical feature, the Explainer works, albeit with a Many to one/many maps found in input warning and produces outputs that don't really make sense (Half the features end up having very, very similar SHAP values).

image

# Define categorical transformer
categorical_transformer = Pipeline(
    steps=[
        ("cat_impute", SimpleImputer(strategy="constant", fill_value='missing')),
        ("onehot", OneHotEncoder(drop=None, handle_unknown="infrequent_if_exist", sparse=False, min_frequency=0.01)),
    ]
)
# Define numeric transformer
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

# Construct list of categorical transformers 
categorical_treatments_list = [(feature, categorical_transformer, [feature]) for feature in categorical_features]

# Construct the data preprocessor
data_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        *categorical_treatments_list
    ],
    remainder="drop",
)
@paulbkoch
Copy link

Hi @dclaz -- This appears to be a question for the interpret-community repo. Transferring your issue there.

@paulbkoch paulbkoch transferred this issue from interpretml/interpret Nov 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants