FilomKhash / Tree-based-paper Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Codes for the paper On marginal feature attributions of tree-based models

BSD-3-Clause license

1 star 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
Explainer		Explainer
MIC_based_grouping		MIC_based_grouping
Retrieve_splits		Retrieve_splits
Synthetic_model_owen		Synthetic_model_owen
Synthetic_model_shapley		Synthetic_model_shapley
TreeSHAP_sanity_check		TreeSHAP_sanity_check
models_metrics		models_metrics
EnsembleParser.py		EnsembleParser.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Tree-based-paper

This repository contains codes for the third arXiv version of the paper

On marginal feature attributions of tree-based models (https://arxiv.org/abs/2302.08434)

This repository contains the following folders:

TreeSHAP_sanity_check: This is to confirm the computations done in Example 3.1 regarding the path-dependent TreeSHAP.
Synthetic_model_shapley and Synthetic_model_owen: The times recorded for our experiments with synthetic data in Section 4.1 are available in these folders along with python scripts for recreating the figures in that section.
models_metrics: In Section 4.2, we experiment with four public datasets. For each of them, a triple consisting of LightGBM, CatBoost and XGBoost models is trained. The model files are provided in the folder along with the Jupyter notebook r2_score.ipynb which replicates their metrics as appeared in Table 5 of the paper.
Retrieve_splits: The goal of the notebook Retrieve_splits.ipynb is to take a saved LightGBM, CatBoost or XGBoost model, decompose it into its constituent trees, and create a dictionary for each decision tree containing information such as distinct features appearing in the tree, tree's depth, the regions cut by the tree etc. This procedure is carried out for the ensembles trained for our experiments in Section 4.2, and results appear in Table 6. This process can be repeated for any trained LightGBM, CatBoost or XGBoost model through importing the script EnsembleParser.py.
Explainer and MIC_based_grouping: In Section 4.3, a proprietary implementation of Algorithm 3.12 is used to explain the four CatBoost models previously trained on public datasets. The corresponding look-up tables of marginal Shapley values are stored in the Explainer folder. As a sanity check, the efficiency property of Shapley values is verified for the outputs of the algorithm in the notebook explanations.ipynb. Moreover, look-up tables containing marginal Owen values for these models were generated through a proprietary code based on Theorem F.1. They are available in the same folder, and the efficiency axiom is checked for them as well in explanations.ipynb. The partitions of the features of the public datasets used for computing Owen values are available in the folder MIC_based_grouping. These are obtained from a hierarchical clustering procedure outlined in the notebook grouping.ipynb.

About

Codes for the paper On marginal feature attributions of tree-based models

feature-attribution tree-based-methods

BSD-3-Clause license

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages