Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TreeSHAP is not exact #142

Closed
gorkemozkaya opened this issue Oct 25, 2019 · 15 comments
Closed

TreeSHAP is not exact #142

gorkemozkaya opened this issue Oct 25, 2019 · 15 comments

Comments

@gorkemozkaya
Copy link

gorkemozkaya commented Oct 25, 2019

In several places, the book suggests that the TreeSHAP algorithm provides exact calculation of the Shapley values:

"Lundberg et. al (2018)46 proposed TreeSHAP, a variant of SHAP for tree-based machine learning models such as decision trees, random forests and gradient boosted trees. TreeSHAP is fast, computes exact Shapley values, and correctly estimates the Shapley values when features are dependent. In comparison, KernelSHAP is expensive to compute and only approximates the actual Shapley values."

"Sampling from the marginal distribution means ignoring the dependence structure between present and absent features. KernelSHAP therefore suffers from the same problem as all permutation-based interpretation methods. The estimation puts too much weight on unlikely instances. Results can become unreliable. As we will see later, TreeSHAP for tree-ensembles is not affected by this issue."

I think this is not accurate, and TreeSHAP suffers from the same issue when there is statistical dependencies between features. I tried to demonstrate it in the notebook below:

https://colab.research.google.com/drive/14wsr-dGOb_suPsj8mtxOLn5zcsi39LHk

In this notebook, I created a dataset and fitted a single-tree XGBoost model. Both the dataset and the model are completely symmetric with respect to the two features: if you simply swap the feature names, neither the joint distribution nor the model prediction changes. So the Shapley Importance values for these two features must be equal. However, the TreeSHAP estimation failed to give the correct estimation. I am explaining the reason at the end of the notebook.

@NicolasHug
Copy link

@gorkemozkaya , I think you are quite correct.

As far as I can tell, the mistake originally comes from the paper Lundberg et. al (2018)46. Indeed, Algorithm 1 from the paper is exactly the algorithm to compute partial dependence, as originally described in [1].

By definition, the partial dependence of f on Xs is E_Xc[f(x)] where Xc is the complement of Xs. It is only equivalent to E[f(x)|xs] if the features are independent.

The TreeSHAP algorithm uses E_Xc[f(x)], i.e. the partial dependence, and it will suffer from exactly the same kind of issues that partial dependence interpretation will have.

CC @slundberg, if I may.

[1] Greedy Function Approximation, a Gradient Boosting Machine -- Jerome Friedman https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

@slundberg
Copy link

@NicolasHug @gorkemozkaya great points. There are some subtle issues here that I am working on documenting better. The original SHAP paper proposed pure conditional expectations for measuring the value of a set of input features, and then proposed using the Shapley values to reduce this exponential number of values down to a single number for each feature. To make things more tractable we can assume feature independence. This is of course never true in practice, and so may seem like a terrible approximation. But it turns out that you can look at this assumption from a very different perspective, where you break feature dependence not because of an independence assumption, but because of arguments based on causal inference. I unfortunately did not include this comment in the final NIPS paper and a recent (more in depth) summary is at: https://arxiv.org/abs/1910.13413

SHAP as it stands today uses the interventional (causal) method of feature perturbation by default. I'll try and get the docs for this wrapped up sooner rather than later.

@amueller
Copy link

Thanks! Just talked to Harsha about this, will definitely look at the paper!

@slundberg
Copy link

@amueller I have also a longer post on this at shap/shap#882

@christophM
Copy link
Owner

Very interesting discussions! Thanks a lot to everyone contributing.

The Janzig et. al paper is on my list. I will read this and update the SHAP chapter accordingly.

This issue goes very deep and concerns all permutation-based interpretation methods.
When features are dependent, it seems like both options have issues: interventional creates new data points that might be far outside the data distribution (where the machine learning model might not behave nicely), conditional entangles the effects of the dependent features.

@slundberg
Copy link

That's right. I view it as a fundamental trade-off between being "true to the data" and never providing inputs that are off-manifold, and being "true to the model" and never letting credit bleed between correlated features. I am working on docs/discussion about this (perhaps it should be a short arXiv) but it is impossible to be both true to the model and true to the data when features are arbitrarily correlated and you want to allocate credit among each input feature. To see this imagine two perfectly correlated input features where the model depends on only one of them. There is no way to know which feature the model uses (when perturbing inputs and observing outputs) without asking what the model would do "off the data manifold".

@amueller
Copy link

Very interesting indeed. I'm not sure if this is addressed in your reference (skimming I didn't see it), but one other point that we were concerned about was the mismatch between the leaf-based and the brute-force method. These are particularly bad when correlated features are present, but my understanding is that even for perfectly independent features, there is a mismatch.

@NicolasHug
Copy link

I built an example of the discrepancy that @amueller 's talking about here (at the end)

While that's dealing with the computations of partial dependence functions, I think that this is still relevant for SHAP vs TreeSHAP (since the same methods are used)

@slundberg
Copy link

@amueller @NicolasHug There is indeed a difference between the output of

TreeExplainer(model, prior, feature_perturbation="tree_path_dependent").shap_values(X)
vs.
KernelExplainer(model.predict, prior).shap_values(X)

because of reasons discussed in the link @NicolasHug shared.

However there is no disagreement between

TreeExplainer(model, prior, feature_perturbation="interventional").shap_values(X)
vs.
KernelExplainer(model.predict, prior).shap_values(X, nsamples=Inf)

@NicolasHug
Copy link

Do you also confirm that tree_path_dependent will also disagree with KernelExplainer(model.predict, prior).shap_values(X, nsamples=Inf)?

Could you please describe what TreeExplainer(interventional) does, or provide a link?

@amueller
Copy link

@NicolasHug that's the papers linked above, I think.

@NicolasHug
Copy link

I don't think so?

arxiv.org/abs/1910.13413 is about saying that we want marginal expectations (i.e. partial dependence) instead of conditional expectation, as claimed in the original SHAP papers. It also explains that the conditional expectations are in fact approximated with marginal expectations in the first place.

What I don't understand is, how does TreeExplainer(interventional) differs from tree_path_dependent, since tree_path_dependent users partial dependence hence marginal expectation hence interventional propabilities.

@slundberg
Copy link

@NicolasHug Yes that is also different. TreeExplainer(interventional) is best described by our Nature MI paper that is in final proof. If you email me though I can share an early PDF directly with you.

I would say tree_path_dependent does not really use true partial dependence, but it is the best you can do if you don't have access to a background dataset and only get to use the data recorded in the model via coverage. tree_path_dependent conditions it's expectation on the values observed higher in the tree, while a true partial dependence plot has no such conditioning.

@pbarcelo
Copy link

pbarcelo commented Dec 9, 2020

I would like to point out that we have recently provided a polytime algorithm that computes the SHAP scores over decision trees, and some generalized models, here: https://arxiv.org/pdf/2007.14045.pdf.

@guyvdbroeck
Copy link

People interested in this issue may also want to read https://arxiv.org/abs/2009.08634 which proves the problem that TreeSHAP tries to solve to be #P-hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants