Anchor Tabular - KeyError in get_features_index #924

pocman · 2023-05-17T13:01:15Z

I have the following error when trying to call explain on an AnchorTabular fitted on a mixed of numerical and categorical data. On alibi 0.9.2.

Error
Traceback (most recent call last):
  File "test_explainer.py", line 21, in test_explainer
    explanation = explainer.explain(X.to_numpy()[1, :],
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 842, in explain
    result: Any = mab.anchor_beam(
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 718, in anchor_beam
    candidate_anchors = self.kllucb(
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 292, in kllucb
    pos, total = self.draw_samples(anchors_to_sample, 1)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 357, in draw_samples
    samples_iter = [self.sample_fcn((i, tuple(self.state['t_order'][anchor])), num_samples=batch_size)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_base.py", line 357, in <listcomp>
    samples_iter = [self.sample_fcn((i, tuple(self.state['t_order'][anchor])), num_samples=batch_size)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 211, in __call__
    raw_data, d_raw_data, coverage = self.perturbation(anchor[1], num_samples)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 284, in perturbation
    allowed_bins, allowed_rows, unk_feat_vals = self.get_features_index(anchor)
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 487, in get_features_index
    allowed_rows = {f_id: self.val2idx[f_id][f_val] for f_id, f_val in zip(cat_feat_ids, cat_feat_vals)}
  File "venv3.8/lib/python3.8/site-packages/alibi/explainers/anchors/anchor_tabular.py", line 487, in <dictcomp>
    allowed_rows = {f_id: self.val2idx[f_id][f_val] for f_id, f_val in zip(cat_feat_ids, cat_feat_vals)}
KeyError: 'value_A'

self.val2idx is of type Dict[int, DefaultDict[int, Any]] = {} but f_val is a value from self.cat_lookup.
And self.cat_lookup maps categorical variables to their value in X.
cat_lookup is initialized in two ways :

self.cat_lookup = dict(zip(self.categorical_features, X))

or

self.cat_lookup[cat_enc_idx] = X[cat_enc_idx]

So, from what I understand, we are using a categorical value (value_A in my error message) stored in f_val to call a dict that is suppose to have int keys.

I'm going to try to understand why anchor_tabular_adult.ipynb is working.

The text was updated successfully, but these errors were encountered:

jklaise · 2023-05-18T08:54:41Z

Hey, usually issues with mixed data arise from a mis-specification of the categorical_names parameter, I would double check this first and make sure that all categorical columns are keyed and the values of each key cover all categories.

Edit: I see the call to_numpy() which suggests that X is a pd.DataFrame. For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

pocman · 2023-05-23T08:07:02Z

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing.
It would be great to add some assert check to enforce that convention in anchor_tabular init.

jklaise · 2023-05-23T08:59:52Z

For categorical variables we use the convention that they need to be label-encoded (i.e. values per category should be [0, 1, ..., n-1] where n is the number of categories for that variable. Would also check if that's the case.

This is it, that's why my example is failing. It would be great to add some assert check to enforce that convention in anchor_tabular init.

There's multiple ways to go about validation and it's usually fairly tricky to validate custom user data, would be keen to hear if you have more specific suggestions, e.g.:

validate categorical_names - this doesn't give us much as it wouldn't confirm whether the actual data is label-encoded or not
validate X_train during fit - here we could cross-reference with categorical_names and check that the categorical columns are as expected

pocman · 2023-05-23T13:02:58Z

I would suggest doing it in gen_category_map and maybe update the description of the method.

On another side, AnchorTabular supporting label-encoded values only is a major blocker for my use-case since in my stack I represent missing data as -1. I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

jklaise · 2023-05-23T13:40:27Z

I understand that some our conventions make it more difficult to cater for all all use-cases, but this is a trade-off we've had to make, at least for the time being. The alternative here would be having to consider any custom encoding scheme, e.g. even allowing label-encoding with arbitrary user-supplied integers for categories would be infeasible without every user providing even more metadata about their specific encoding.

As for your case, there are a couple of workarounds. Essentially missing data is another type of category (separate for each categorical data column). This gives two options:

If changing your encoding is an option, you could encode the missing values as the last category for each column. E.g. instead of -1 for every missing value across all categories, for a column i with categories encoded as 0, 1, ..., n_i-1, a missing value would be encoded as n_i (i.e. as an extra category for column i).
If changing the encoding for your model is not feasible, you could consider writing a wrapper prediction function similar to this. I.e. the wrapper function would expect the data as alibi expects it (label-encoded - you could use the same trick as above to encode missing data as an extra category), then a transform_input function would transform all those extra categories to -1 before feeding into the model.

I believe minor changes in the exampler would allow to suppose both encoded and raw label values.

I'm not sure I follow here, do you mean string labels for "raw label values" here? Would be good to see what you have in mind.

jklaise added the Type: Question User questions label May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchor Tabular - KeyError in get_features_index #924

Anchor Tabular - KeyError in get_features_index #924

pocman commented May 17, 2023 •

edited

jklaise commented May 18, 2023 •

edited

pocman commented May 23, 2023 •

edited

jklaise commented May 23, 2023

pocman commented May 23, 2023 •

edited

jklaise commented May 23, 2023

Anchor Tabular - KeyError in get_features_index #924

Anchor Tabular - KeyError in get_features_index #924

Comments

pocman commented May 17, 2023 • edited

jklaise commented May 18, 2023 • edited

pocman commented May 23, 2023 • edited

jklaise commented May 23, 2023

pocman commented May 23, 2023 • edited

jklaise commented May 23, 2023

pocman commented May 17, 2023 •

edited

jklaise commented May 18, 2023 •

edited

pocman commented May 23, 2023 •

edited

pocman commented May 23, 2023 •

edited