Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question. How good is my surrogate model? #502

Open
SamiurRahman1 opened this issue Feb 10, 2022 · 8 comments
Open

Question. How good is my surrogate model? #502

SamiurRahman1 opened this issue Feb 10, 2022 · 8 comments

Comments

@SamiurRahman1
Copy link

Hi, i have seen that there is a function to calculate the r2 score of the surrogate model. I was wondering, are there any other simple metrics implemented to measure how good the surrogate model is?

Thanks

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 10, 2022

hi @SamiurRahman1 the score can be computed via get_surrogate_model_replication_measure which was just made public as part of resolving this issue:
#452
and PR:
#495
we currently don't have other metrics, but I think it may be possible to add more. Note that this is just a measure of how good the surrogate model fits the teacher model, it doesn't tell you how accurate the explanations themselves are - and in this case they are just approximations. Can you talk a bit more about your use-case? If you require the model to be interpretable, and the explanations can't be approximations, then you may want to consider using a glassbox model.
You may also want to consider using permutation feature importance instead, which permutes columns one at a time on a trained model (there is another variant that retrains the model which is not implemented in this repository) and assigns the feature importance based on how much a chosen metric changes for the permuted column. Note there may be issues in that method for assigning importances if there are highly correlated features. Also, it is slower than the mimic explainer, and isn't really feasible if you have high dimensional data, including sparse data. Hope that info helps.

@imatiach-msft
Copy link
Collaborator

an amazing free book on interpretability has a great chapter on global surrogate models:
https://christophm.github.io/interpretable-ml-book/global.html
I think the sections on advantages and disadvantages summarize this method very well.
Note it doesn't mention any other metrics besides R-squared, but I think we could add a lot of other metrics.

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 10, 2022

note that we use accuracy metric for classification and r^2 for regression currently:

   def get_surrogate_model_replication_measure(self, training_data):
        """Return the metric which tells how well the surrogate model replicates the teacher model.
        For classification scenarios, this function will return accuracy. For regression scenarios,
        this function will return r2_score.
        :param training_data: The data for getting the replication metric.
        :type training_data: numpy.ndarray or pandas.DataFrame or scipy.sparse.csr_matrix
        :return: Metric that tells how well the surrogate model replicates the behavior of teacher model.
        :rtype: float
        """

but I think a lot of other metrics could be added. I think it might even be interesting to run the surrogate model through error analysis where the "true" labels are actually the predicted labels from the teacher model to see where the surrogate model is making errors. You can find the ErrorAnalysisDashboard here:
https://github.com/microsoft/responsible-ai-toolbox
with a lot of notebook examples here:
https://github.com/microsoft/responsible-ai-toolbox/tree/main/notebooks/individual-dashboards/erroranalysis-dashboard

@SamiurRahman1
Copy link
Author

thanks for your explanations. i might have formulated my question wrong. yes, i would like to understand or measure how well my surrogate model fits or represents my teacher model. i have read several research papers about different metrics like stability, robustness and efficiency. But i consider them as more advanced metrics. Hence i was looking for any other light-weight metrics like r2.

i have read the book that you mentioned and i found it very informative and useful.
i also have used the current available get_surrogate_model_replication_measure function. thanks for suggesting the ErrorAnalysisDashboard, i will look into it.

my use case: i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters(we get results by aggregating them) and when we use interpreters which don't depend on local interpreter(permutation feature importance). And if we get different list of important features from the two scenarios, i would like to use different metrics to measure which surrogate model is more fitting to the teacher model.

@imatiach-msft
Copy link
Collaborator

"i have read several research papers about different metrics like stability, robustness and efficiency"
interesting, can you point to the papers specifically, maybe some of these could be implemented in this repository? We could create issues to mark these as methods that should be implemented.

"i am trying to experiment whether the global interpretation differs when we use interpreters which are dependent on local interpreters and when we use interpreters which don't depend on local interpreter"
That sounds like really interesting research! I'm very curious to hear what you find.

@SamiurRahman1
Copy link
Author

SamiurRahman1 commented Feb 10, 2022

here are some few example papers that talk about different evaluation methods for interpreters. the most i am interested in is number 2.

  1. https://arxiv.org/abs/1906.02108
  2. https://arxiv.org/abs/2008.05895
  3. https://arxiv.org/abs/1910.02065

@imatiach-msft
Copy link
Collaborator

imatiach-msft commented Feb 10, 2022

I have a hard time believing the second paper's results that LIME is better than SHAP - perhaps on that dataset, but for LIME you need to set the kernel width parameter, which is very tricky to figure out. If you get it wrong you can get very bad results. SHAP doesn't have that problem. Also all of those datasets are too similar, none of them have high dimensional or sparse features it sounds like. Their results would be much more interesting if they evaluated on a wide range of datasets that vary a lot more.

@SamiurRahman1
Copy link
Author

i agree with your perspective. :) also, these papers are not from very good journals. but my main focus was the metrics. i am not worried about their results, rather the metrics they proposed to evaluate different interpreters. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants