Dependent Variables? #317

Xschwartz16 · 2022-10-30T20:36:49Z

Hello! I'm just wondering if you code point me to the right direction for understanding how Shapley deals with dependent or correlated variables? Trying to understand how it is calculated differently under the hood than if all the features are independent. Thank you!

martinju · 2022-11-01T07:31:02Z

Hi!
I suggest reading our paper https://www.sciencedirect.com/science/article/pii/S0004370221000539
That should describe quite clearly the difference (it is all about whether you estimate v(S)=E[f(x)|x_S] properly or not.)

Xschwartz16 · 2022-11-01T18:14:31Z

Thanks so much fr the response! Would you have any advice about how to code that in a linear regression context? Trying to work through coding simple Shapley examples by hand (i.e. using base R and tidyverse) and struggling to make sure I am doing that correctly.

martinju · 2022-11-01T18:37:04Z

No problem. Take a look at Appendix B in the aforementioned paper. There we write out a simplified explicit formula for the linear regression case both when assuming independence and not assuming independence. Assuming a simple dependence structure, it should be straightforward to code it up from there. If you assume e.g. Gaussian features, you could then always double check using the shapr package.

Xschwartz16 · 2022-11-03T00:10:14Z

Hello!

Thank you for all your help. Still trying to understand the best way to code Shapley values from a conceptual standpoint.

Say I have $x_1$ and $x_2$ that are drawn from some multivariate normal distribution with some amount of correlation and follow the general formula of:

$y = \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon$

From a Shapley standpoint we have

$m_1 = y \sim 1$
$m_2 = y \sim x_1$
$m_3 = y \sim x_2$
$m_4 = y \sim x_1 + x_2$

If we define $f_i$ to be the prediction of $y_i$ generated by $m_i$ so our Shapley value of $x_1$ would then be $\frac{(f_2 - f_1)+(f_4-f_3)}{2}$ or the average of the differences between $(f_4,f_3)$ and $(f_2,f_1)$.

However, these values generated by this method are slightly different than the values generated by the shapr and iml packages, when the correlation is not 0 (they match when the correlation is 0). Is there any chance you could point us in the right direction of where I am going wrong?

Thank you!

Xschwartz16 · 2022-11-30T04:50:51Z

Just wanted to see if you happened to have a chance to look at this. Thank you so much!

martinju · 2022-11-30T07:51:18Z

Hi, sorry, totally forgot about this.

What you are talking about above is Shapley regression values, which retrain model on every subset of the features, and predict with those submodels to mimic feature removal. We (and essentially everyone else doing Shapley value based prediction explanation) don't retrain models, but instead, use the expected prediction conditional on different subsets of the features.

If you want more information, I suggest looking up "Shapley regression values" (I believe the term is used both for global R squared decomposition and local prediciton explanation), and re-visit our paper in view of that :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dependent Variables? #317

Dependent Variables? #317

Xschwartz16 commented Oct 30, 2022

martinju commented Nov 1, 2022

Xschwartz16 commented Nov 1, 2022

martinju commented Nov 1, 2022

Xschwartz16 commented Nov 3, 2022

Xschwartz16 commented Nov 30, 2022

martinju commented Nov 30, 2022

Dependent Variables? #317

Dependent Variables? #317

Comments

Xschwartz16 commented Oct 30, 2022

martinju commented Nov 1, 2022

Xschwartz16 commented Nov 1, 2022

martinju commented Nov 1, 2022

Xschwartz16 commented Nov 3, 2022

Xschwartz16 commented Nov 30, 2022

martinju commented Nov 30, 2022