Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anchor Tabular - KeyError in get_features_index #924

Open
pocman opened this issue May 17, 2023 · 5 comments
Open

Anchor Tabular - KeyError in get_features_index #924

pocman opened this issue May 17, 2023 · 5 comments
Labels
Type: Question User questions

Comments

@pocman
Copy link

pocman commented May 17, 2023

I have the following error when trying to call explain on an AnchorTabular fitted on a mixed of numerical and categorical data. On alibi 0.9.2.

Error
Traceback (most recent call last):
  File "test_explainer.py", line 21, in test_explainer
    explanation = explainer.explain(X.to_numpy()[1, :],
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 842, in explain
    result: Any = mab.anchor_beam(
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 718, in anchor_beam
    candidate_anchors = self.kllucb(
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 292, in kllucb
    pos, total = self.draw_samples(anchors_to_sample, 1)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 357, in draw_samples
    samples_iter = [self.sample_fcn((i, tuple(self.state['t_order'][anchor])), num_samples=batch_size)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 357, in <listcomp>
    samples_iter = [self.sample_fcn((i, tuple(self.state['t_order'][anchor])), num_samples=batch_size)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 211, in __call__
    raw_data, d_raw_data, coverage = self.perturbation(anchor[1], num_samples)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 284, in perturbation
    allowed_bins, allowed_rows, unk_feat_vals = self.get_features_index(anchor)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 487, in get_features_index
    allowed_rows = {f_id: self.val2idx[f_id][f_val] for f_id, f_val in zip(cat_feat_ids, cat_feat_vals)}
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 487, in <dictcomp>
    allowed_rows = {f_id: self.val2idx[f_id][f_val] for f_id, f_val in zip(cat_feat_ids, cat_feat_vals)}
KeyError: 'value_A'

self.val2idx is of type Dict[int, DefaultDict[int, Any]] = {} but f_val is a value from self.cat_lookup.
And self.cat_lookup maps categorical variables to their value in X.
cat_lookup is initialized in two ways :

self.cat_lookup = dict(zip(self.categorical_features, X))

or

self.cat_lookup[cat_enc_idx] = X[cat_enc_idx]

So, from what I understand, we are using a categorical value (value_A in my error message) stored in f_val to call a dict that is suppose to have int keys.

I'm going to try to understand why anchor_tabular_adult.ipynb is working.

@jklaise
Copy link
Member

jklaise commented May 18, 2023

Hey, usually issues with mixed data arise from a mis-specification of the categorical_names parameter, I would double check this first and make sure that all categorical columns are keyed and the values of each key cover all categories.

Edit: I see the call to_numpy() which suggests that X is a pd.DataFrame. For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

@jklaise jklaise added the Type: Question User questions label May 22, 2023
@pocman
Copy link
Author

pocman commented May 23, 2023

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing.
It would be great to add some assert check to enforce that convention in anchor_tabular init.

@jklaise
Copy link
Member

jklaise commented May 23, 2023

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing. It would be great to add some assert check to enforce that convention in anchor_tabular init.

There's multiple ways to go about validation and it's usually fairly tricky to validate custom user data, would be keen to hear if you have more specific suggestions, e.g.:

  • validate categorical_names - this doesn't give us much as it wouldn't confirm whether the actual data is label-encoded or not
  • validate X_train during fit - here we could cross-reference with categorical_names and check that the categorical columns are as expected

@pocman
Copy link
Author

pocman commented May 23, 2023

I would suggest doing it in gen_category_map and maybe update the description of the method.


On another side, AnchorTabular supporting label-encoded values only is a major blocker for my use-case since in my stack I represent missing data as -1. I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

@jklaise
Copy link
Member

jklaise commented May 23, 2023

I understand that some our conventions make it more difficult to cater for all all use-cases, but this is a trade-off we've had to make, at least for the time being. The alternative here would be having to consider any custom encoding scheme, e.g. even allowing label-encoding with arbitrary user-supplied integers for categories would be infeasible without every user providing even more metadata about their specific encoding.

As for your case, there are a couple of workarounds. Essentially missing data is another type of category (separate for each categorical data column). This gives two options:

  • If changing your encoding is an option, you could encode the missing values as the last category for each column. E.g. instead of -1 for every missing value across all categories, for a column i with categories encoded as 0, 1, ..., n_i-1, a missing value would be encoded as n_i (i.e. as an extra category for column i).
  • If changing the encoding for your model is not feasible, you could consider writing a wrapper prediction function similar to this. I.e. the wrapper function would expect the data as alibi expects it (label-encoded - you could use the same trick as above to encode missing data as an extra category), then a transform_input function would transform all those extra categories to -1 before feeding into the model.

I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

I'm not sure I follow here, do you mean string labels for "raw label values" here? Would be good to see what you have in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question User questions
Projects
None yet
Development

No branches or pull requests

2 participants