New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain Predictions: Small discrepancies between base value and average probability #32
Comments
You are correct that the base value should be the average (or rather expected) probability. SHAP however does not compute this on all reference data (because it could be very slow) but instead computes it on centroids of clustered reference data. So TL;DR, the base value is an approximation. In your concrete case, you provide 303 instances as original reference data, which get clustered into 10 clusters (k=10 set in |
As @matejklemen said it is an approximation that is in all cases very close to the actual base value. Here the solution could be to calculate our own base value (to make predictions and then average them), but I do not know if it is really necessary. @ajdapretnar, do you have a case where it would be necessary to have an exact base value? Maybe we can just add information that is an approximation in the documentation. |
@PrimozGodec My naive interpretation is class distribution (i.e. 33% for iris setosa) and mean (i.e. 22.533 for housing). But I assume this is not what is meant by "base value". |
I was about to log this and see that there is active discussion. I replicated the example code from the python library shap, and then exported the data to Orange. Subtle difference on the reported base value in Orange as discussed above, but spot on if I only used the python library. |
Closing since it is expected behaviour. |
Assuming I understand the widget and SHAP values correctly, the "Base value" should be the average probability for the given class value. In reality, this is not the case.
Example. I use heart-disease data and Logistic Regression. Then I predict the first instance in the data set. This is the result of Explain Predictions.
Base value is supposed to be 0.47. I now check this in the Box Plot. I use the same model with Predictions and pass the same "Background Data". Then I use Box Plot to observe the probabilities for a given class value, in this case, the Logistic Regression (1). This is the result.
The mean of Logistic Regression predictions in Box Plot is 0.458. Explain Predictions reports base value as 0.474. Why the difference?
Versions:
shap==0.37.0
Shapely==1.7.1
The text was updated successfully, but these errors were encountered: