Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when initializing explainer through TabularExplainer and KernelExplainer #463

Open
lucazav opened this issue Nov 18, 2021 · 1 comment

Comments

@lucazav
Copy link

lucazav commented Nov 18, 2021

I have a trained regression model (a VotingEnsemble model obtained through training with Azure AutoML) and I'd like to generate an explainer using TabularExplainer.
My dataset has a column ('CALENDAR_DATE') of type datetime64[ns], which my model handles correctly (predict method works fine).
After the import of the TabularExplainer class, I tried to initialize my explainer through:

features = X_train.columns

explainer = TabularExplainer(model,
                                  X_train,
                                  features=features,
                                  model_task = 'regression')

but I get the following error:

RuntimeError: cuML is required to use GPU explainers.
Check https://rapids.ai/start.html for more
information on how to install it.
The above exception was the direct cause of the following exception:
[...]
ValueError: Could not find valid explainer to explain model

I get the same error message when I force:

explainer = TabularExplainer(model,
                                  X_train,
                                  features=features,
                                  model_task = 'regression',
                                  use_gpu=False)

Thus, I proceeded trying to explicitly inizialize a KernelExplainer, thorugh:

explainer = KernelExplainer(model,
                                 X_train,
                                 features=features,
model_task = 'regression')

but I received the error:

float() argument must be a string or a number, not 'Timestamp'

Therefore I changed the 'CALENDAR_DATE' column type to string, with:

X_train_copy = X_train.copy()
X_train_copy['CALENDAR_DATE'] = X_train_copy['CALENDAR_DATE'].astype(str)

After this, both TabularExplainer and KernelExplainer correctly work when initializing the explainers (with the modified dataset X_train_copy).

Why does this happen?

@imatiach-msft
Copy link
Collaborator

@lucazav I believe the issue with the bad cuML error message appeared in interpret-community 0.18.0 and was fixed in 0.21.0:

#450
See description:

Based on experience of debugging with customer, when TabularExplainer fails with default use_gpu=False on GPUKernelExplainer it prints the last warning, even though it will always fail. This PR separates it out so it only runs when use_gpu flag is on. The previous logic would skip every explainer if use_gpu=True other than GPUKernelExplainer, but still for some reason run it even if use_gpu=False. By separating it out, the customer will once again see the most useful error message from the last default catch-all KernelExplainer.'

With latest version you will see the error you saw:
float() argument must be a string or a number, not 'Timestamp'

This seems to be due to the timestamp column. However, it seems like the explainers should be able to support this datatype, based on:

https://github.com/interpretml/interpret-community/blob/master/python/interpret_community/dataset/dataset_wrapper.py#L25

It should automatically featurize the timestamp column and explain numeric fields:

                tmp_dataset[time_col_name + '_year'] = tmp_dataset[time_col_name].map(lambda x: x.year)
                tmp_dataset[time_col_name + '_month'] = tmp_dataset[time_col_name].map(lambda x: x.month)
                tmp_dataset[time_col_name + '_day'] = tmp_dataset[time_col_name].map(lambda x: x.day)
                tmp_dataset[time_col_name + '_hour'] = tmp_dataset[time_col_name].map(lambda x: x.hour)
                tmp_dataset[time_col_name + '_minute'] = tmp_dataset[time_col_name].map(lambda x: x.minute)
                tmp_dataset[time_col_name + '_second'] = tmp_dataset[time_col_name].map(lambda x: x.second)

I think I see the problem. This only exists on mimic explainer, based on this search:
https://github.com/interpretml/interpret-community/search?q=apply_timestamp_featurizer

So basically all other explainers (except for MimicExplainer) can't handle timestamp type column. You can convert the column to numeric (eg the float value in seconds), but for some explainers like LIME explainer it won't work well - specifically for LIME you won't be able to sample around the value correctly to get meaningful results. For KernelExplainer it might work more sensibly, since it's just replacing the value with the background data and not trying to change it, and the feature importance might be correct in the sense of how important the column is but difficult to intepret in the sense that increasing/decreasing the value will result in a specific change to the output (which you can't assume anyway with shap values, but which will be especially difficult to assume here), since there may be many cyclical/seasonal complex relationships for the time feature. I think it's more useful to break the time feature into the components like above and view feature importances in terms of day/hour/month/etc to get a better understanding of how it may influence the model's output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants