Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain Predictions: Small discrepancies between base value and average probability #32

Closed
ajdapretnar opened this issue Jun 18, 2021 · 5 comments

Comments

@ajdapretnar
Copy link
Contributor

Assuming I understand the widget and SHAP values correctly, the "Base value" should be the average probability for the given class value. In reality, this is not the case.

Example. I use heart-disease data and Logistic Regression. Then I predict the first instance in the data set. This is the result of Explain Predictions.

Screen Shot 2021-06-18 at 16 53 35

Base value is supposed to be 0.47. I now check this in the Box Plot. I use the same model with Predictions and pass the same "Background Data". Then I use Box Plot to observe the probabilities for a given class value, in this case, the Logistic Regression (1). This is the result.

Screen Shot 2021-06-18 at 16 53 47

The mean of Logistic Regression predictions in Box Plot is 0.458. Explain Predictions reports base value as 0.474. Why the difference?

Versions:
shap==0.37.0
Shapely==1.7.1

@matejklemen
Copy link

You are correct that the base value should be the average (or rather expected) probability. SHAP however does not compute this on all reference data (because it could be very slow) but instead computes it on centroids of clustered reference data. So TL;DR, the base value is an approximation.

In your concrete case, you provide 303 instances as original reference data, which get clustered into 10 clusters (k=10 set in _explain_other_models). The 10 centroids are passed through the model and the expected value is calculated based on these predictions, weighted by the proportion of original reference instances which fall into each cluster.

@PrimozGodec
Copy link
Collaborator

As @matejklemen said it is an approximation that is in all cases very close to the actual base value. Here the solution could be to calculate our own base value (to make predictions and then average them), but I do not know if it is really necessary.

@ajdapretnar, do you have a case where it would be necessary to have an exact base value? Maybe we can just add information that is an approximation in the documentation.

@ajdapretnar
Copy link
Contributor Author

@PrimozGodec My naive interpretation is class distribution (i.e. 33% for iris setosa) and mean (i.e. 22.533 for housing).

But I assume this is not what is meant by "base value".

@Btibert3
Copy link

I was about to log this and see that there is active discussion. I replicated the example code from the python library shap, and then exported the data to Orange. Subtle difference on the reported base value in Orange as discussed above, but spot on if I only used the python library.

@PrimozGodec
Copy link
Collaborator

Closing since it is expected behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants