Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

permute_cases: Error arguments imply differing number of rows: 30000, 0 #173

Open
agilebean opened this issue Dec 24, 2019 · 0 comments
Open

Comments

@agilebean
Copy link

agilebean commented Dec 24, 2019

Dear lime contributors,
thanks for your awesome work on this repository.
Alas, I got an error that took me several days to figure out, and is reproducible:

explanation.lime <- lime::explain(
  x = local.obs,
  explainer = explainer.lime,
  n_features = 5 
)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 30000, 0

Fortunately I reached a point that I not only could narrow down the location of the source code but also the conditions that trigger it - but not completely, so I hope you figure out the last mile.

The condition that triggers it is a column in the cases argument of permute_cases that has zero variance and is integer, in my case it is column reviews.numHelpful

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	6 obs. of  13 variables:
 $ reviews.doRecommend: Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2
 $ reviews.numHelpful : int  0 0 0 0 0 0
 $ reviews.rating     : int  4 4 4 5 5 5
 $ anger              : num  0 0 0 0 0 0

This column leads to an empty output within the permute_cases.data.frame function in the lines identfying the "bin" ifelse statement:

} else if (is.numeric(cases[[i]]) && bin_continuous) {
      bin <- sample(seq_along(feature_distribution[[i]]), nrows, TRUE, as.numeric(feature_distribution[[i]]))
      diff(bin_cuts[[i]])[bin] * runif(nrows) + bin_cuts[[i]][bin]
    }

which can be seen here:

$ : Factor w/ 2 levels "1","2": 1 2 1 2 2 2 2 1 2 1 ...
$ : int(0)
$ : int [1:30000] 14 5 5 19 31 10 27 7 10 10 ...
$ : num [1:30000] 0.021654 0.081145 0.039533 0.000972 0.029057 ...

I disentangled the type conversion to dataframe and thus found that this throws the above error:

perms <- as.data.frame(perms, stringsAsFactors = FALSE)

The feature_distribution[[2]] gives:

     FALSE       TRUE 
0.04648887 0.95351113 

This is wrong! This result should come from the only factor, i.e. the first column and thus rendered by feature_distribution[[2]]!
Consequently, the next line diff(bin_cuts[[2]])[bin] always returns NULL which leads to an empty return value integer(0)

So far, I could narrow the root cause to this point - but I am clueless what diff(bin_cuts[[2]])[bin] means and how this can be prevented.

Update

I found a potential reason for this apparent index problem.
The feature distribution includes the target variable .outcome as first list item, and thus all indeces are wrong by offset 1:

$.outcome
        1         2 
0.3277057 0.6722943 

$reviews.doRecommend
     FALSE       TRUE 
0.04648887 0.95351113
 
$reviews.numHelpful
           1            2            3            4 
0.9981241334 0.0012233912 0.0001631188 0.0004893565 

$anger
          1           2           3           4 
0.911100237 0.065900008 0.013620422 0.009379333

However, the target variable is inevitable because the documentation for ?lime specifies:

x The training data used for training the model that should be explained.

So the training data (including the target), not the features (excluding the target) must be fed into lime::lime(). Now I wonder:

Is this a problem inlime::lime() or permutate_cases()??

Can you fix this?? Tricky...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant