Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependent Variables? #317

Open
Xschwartz16 opened this issue Oct 30, 2022 · 6 comments
Open

Dependent Variables? #317

Xschwartz16 opened this issue Oct 30, 2022 · 6 comments

Comments

@Xschwartz16
Copy link

Hello! I'm just wondering if you code point me to the right direction for understanding how Shapley deals with dependent or correlated variables? Trying to understand how it is calculated differently under the hood than if all the features are independent. Thank you!

@martinju
Copy link
Member

martinju commented Nov 1, 2022

Hi!
I suggest reading our paper https://www.sciencedirect.com/science/article/pii/S0004370221000539
That should describe quite clearly the difference (it is all about whether you estimate v(S)=E[f(x)|x_S] properly or not.)

@Xschwartz16
Copy link
Author

Thanks so much fr the response! Would you have any advice about how to code that in a linear regression context? Trying to work through coding simple Shapley examples by hand (i.e. using base R and tidyverse) and struggling to make sure I am doing that correctly.

@martinju
Copy link
Member

martinju commented Nov 1, 2022

No problem. Take a look at Appendix B in the aforementioned paper. There we write out a simplified explicit formula for the linear regression case both when assuming independence and not assuming independence. Assuming a simple dependence structure, it should be straightforward to code it up from there. If you assume e.g. Gaussian features, you could then always double check using the shapr package.

@Xschwartz16
Copy link
Author

Hello!

Thank you for all your help. Still trying to understand the best way to code Shapley values from a conceptual standpoint.

Say I have $x_1$ and $x_2$ that are drawn from some multivariate normal distribution with some amount of correlation and follow the general formula of:

$y = \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon$

From a Shapley standpoint we have

$m_1 = y \sim 1$
$m_2 = y \sim x_1$
$m_3 = y \sim x_2$
$m_4 = y \sim x_1 + x_2$

If we define $f_i$ to be the prediction of $y_i$ generated by $m_i$ so our Shapley value of $x_1$ would then be $\frac{(f_2 - f_1)+(f_4-f_3)}{2}$ or the average of the differences between $(f_4,f_3)$ and $(f_2,f_1)$.

However, these values generated by this method are slightly different than the values generated by the shapr and iml packages, when the correlation is not 0 (they match when the correlation is 0). Is there any chance you could point us in the right direction of where I am going wrong?

Thank you!

@Xschwartz16
Copy link
Author

Just wanted to see if you happened to have a chance to look at this. Thank you so much!

@martinju
Copy link
Member

Hi, sorry, totally forgot about this.

What you are talking about above is Shapley regression values, which retrain model on every subset of the features, and predict with those submodels to mimic feature removal. We (and essentially everyone else doing Shapley value based prediction explanation) don't retrain models, but instead, use the expected prediction conditional on different subsets of the features.

If you want more information, I suggest looking up "Shapley regression values" (I believe the term is used both for global R squared decomposition and local prediciton explanation), and re-visit our paper in view of that :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants