Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result from symmetric shap does NOT match with the SHAP package #12

Open
gtmdotme opened this issue Apr 8, 2020 · 3 comments
Open

Result from symmetric shap does NOT match with the SHAP package #12

gtmdotme opened this issue Apr 8, 2020 · 3 comments

Comments

@gtmdotme
Copy link

gtmdotme commented Apr 8, 2020

I dumped the adult_dataset that you mention in ReadMe into a csv and run a RandomForestClassifier with almost same settings and calculate shap values from the SHAP package library in python (as given by the author of SHAP paper). I then compare these results with the symmetric counterpart of your library.

  1. As a result, I can't see the same (even approximately) set of global shapley values.
  2. Also I don't understand for calculating the global shapley value, you find the mean of shapley values for every instance while the SHAP paper suggests doing an mean of absolute of those shapley values.

Pseudo Code:

import numpy as np
import pandas as pd
import shap
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('adult_dataset.csv')
# encode categorical variables and get features and labels in X, y
X, y = preprocess(df)
model = RandomForestClassifier(max_depth=6, random_state=0, n_estimators=300)
model.fit(X, y)

shap.initjs()
explainer= shap.TreeExplainer(model, data=X)
shap_values = explainer.shap_values(X)

# Global shapley values
gsv = np.mean(np.abs(shap_values[1]), axis=0)

As a side note, TreeExplainer finds exact shap values but your results don't match with KernelExplainer even.

Thanks in advance.

@gtmdotme gtmdotme changed the title Result from symmetric shap does not match with the SHAP package Result from symmetric shap does NOT match with the SHAP package Apr 8, 2020
@nredell
Copy link
Owner

nredell commented Apr 8, 2020

I think I know what's going on here. I'll give it a go soon.

@gtmdotme
Copy link
Author

gtmdotme commented Apr 8, 2020

Reading the paper for Asymmetric Shapley Values again, I realised that the definition of Global Shapley values is different for both.

  1. Asymmetric Shap paper says to get the expectation (or equivalently mean) of individual shap values.
  2. Symmetric Shap paper says to get mean of absolute shap values.

But with both of these defitions, the shap values vary a lot for the two implementations of symmetric shap (the one that your repo implements and the one from SHAP package)

@nredell
Copy link
Owner

nredell commented Apr 8, 2020

Alright. Disclaimer: I've been putting my open source dev time into other projects lately. This package is still experimental, but I plan on revisiting it in a very dedicated way in a couple weeks.

The first check I did was below. If you run the code from this vignette and then run the code below, you'll get this plot. I've connected the same explained instances with a black line to get a better sense of the separation. The agreement is fairly strong. I'm trusting that the good folks at catboost have a solid implementation of TreeSHAP. The comparison is in log-odds space, however...(continued below image).

data_plot <- data_all[!data_all$feature_name %in% names(cat_features), ]

data_plot <- tidyr::pivot_longer(data_plot, cols = c("shap_effect", "shap_effect_catboost"), 
                                 names_to = "algorithm", values_to = "shap_effect")

data_plot$feature_value <- as.numeric(as.character(data_plot$feature_value))

p <- ggplot(data_plot, aes(feature_value, shap_effect, color = algorithm, group = index))
p <- p + geom_point(alpha = .25)
p <- p + geom_line(color = "black")
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab("Feature values") + ylab("Shapley values") + 
  theme(axis.title = element_text(face = "bold"), legend.position = "bottom") + labs(color = NULL)
p

algorithm_comparison

shap is an awesome and much more fully featured package (I'm going to be more focused on causality in shapFlex when I get back to it). It does several things behind the scenes; namely, there is a special constraint algorithm that converts the log-odds space to the probability space while keeping the additivity property of Shapley values so the sum of the feature-level Shapley values equals 1. I don't have this correction anywhere which is more of a problem with classification than regression. This is likely the main difference.

A less important difference is that Shapley value calculations are model dependent...the Random Forest implementations here differ a fair bit in the details. Still, I would expect the Shapley values to be highly correlated. I'll produce the same plot above but in probability space to see where things may be off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants