Add interpretability example notebooks #21

jshinm · 2022-05-26T07:10:39Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Add 3 interpretability example notebooks

Iris notebook
Simulation notebook
MNIST notebook

Any other comments?

adam2392

LGTM once the following changes are made:

0. For simulation notebook: I would remove the gaussian circles and just focus on the sparse parity since that shows the most difference. and Remove max_feature=3*n_features
1. notebook/iris_benchmark_OF_vs_RF.ipynb move the relevant OF part content into examples/tree/plot_iris_dtc.py.
2. For simulation notebook: Add description on the sparse parity problem according to the reference I linked. Here is a paraphrased summary of what we want to say:3.

Ref for sparse parity: https://epubs.siam.org/doi/epdf/10.1137/1.9781611974973.56

Sparse parity is a variation of the noisy parity problem, which itself is a multivariate generalization of the noisy XOR problem. This is a binary classification task in high dimensions. 

<describe sparse parity as done in the paper in more laymen terms>

<describe the intuition for why OF would be better than RF>
e.g. OF should be more robust to high-dimensional noise. Moreover, due to the ability to sample more variable splits (i.e. `max_features` can be greater than `n_features` compared to RF), then we expect to see an increase in performance when we are willing to use computational power to sample more splits.

...

3. For MNIST notebook: only show max_features = sqrt and n_features.
Add a section describing the dataset very briefly and then linking to https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html for reference.
Add a section similar to sparse parity talking about the differences between OF and RF. Add multi-class ROC curve.
Add similar reports shown in the existing digits example.

Ideally we can try to have this done by Friday so we can show these to sklearn devs at OH on Monday. If you can't have this done by then (I know you have a lot of stuff going on!), please let me know and I can help out so we can have things ready by Monday.

jshinm · 2022-06-14T05:10:05Z

6/13/2022

TODOS:

add abs value plots in addition to delta plots
grid search parameters
- test following parameters
  - n_estimator: range(100, 1000, 100)
  - max_depth: [None, 5, 10, 15, 20]
  - max_features: ['sqrt', 'log2', 1x mtry, 2x mtry]
add robustedness test over confusion matrix
performance metric (memory [depth, number of leaves] vs accuracy; protocol-5)
don't show 2x mtry for RF
use stripplot in addition to the box plot (at alpha=0.3-4)
- fix double legend (currently legend disabled)
plot delta and abs plots separately

Additional refs from sklearn dev team

adam2392 · 2022-06-29T13:41:59Z

For documentation that will get merged into the PR branch:

https://github.com/scikit-learn/scikit-learn/blob/main/doc/modules/tree.rst we should modify this to add a section on "Oblique Trees" with a summary of how they're different from regular decision trees and high level intuition on when they would be better vs not and trade-offs to be aware of in terms of fitting/score time and classifier size vs the performance.
under examples/ensemble/, we should add a file plot_oblique_axis_aligned_forests.py, which compares Oblique vs Random forests on a real dataset and perhaps a short version of the sparse-parity simulation. Ideally entire example can run under 30 seconds with RF and OF training. We can subsample the dataset if needed.

For the real datasets, we can use cnae-9 and phishing-websites and wdbc from openml, which seemed to have differing performances for OF and RF:

Ideally we can have some intuition on why RF vs OF is better in one of these...

jshinm added 2 commits May 26, 2022 03:03

grid search parameters

d7b607b

upload iris notebook

3d599b0

jshinm requested a review from adam2392 May 26, 2022 07:10

jshinm self-assigned this May 26, 2022

jshinm added 6 commits May 26, 2022 07:18

add 5000 sample

496969c

add delta plot

6e6f787

add mnist notebook

b3d3aa6

fix delta calculation

c5af203

preserve comprehensive run

6e1187c

optimize runtime

8d45399

adam2392 requested changes Jun 8, 2022

View reviewed changes

jshinm added 8 commits June 10, 2022 13:38

fix bug and reparameterize

6006426

add roc_auc and confusion matrix

d45d2a8

add 3d visualization

6360f3d

add narratives and descriptions

d5b19f2

add description

64169d8

remove long notebook

ef5d9bc

added oblique trees

faa0c53

remove ovr wrapper

7a23b9b

jshinm added 10 commits June 15, 2022 14:59

add grid search results

83cda7a

remove over feature selection filter for RF

7030191

add plotly io for plot rendering

01826d6

add robustness test

0a016e3

add description and rerun notebook

b18d356

new parameter search

c56f351

change plot style

22f643d

optimize robustness test with new parameters

01136c5

change plot style

dccfdd8

run appendix block

14b172d

jshinm added 5 commits June 29, 2022 00:38

Merge branch 'neurodata:obliquepr' into obliquepr

d3087a5

added score vs performance metrics

5a15c53

uploading pickled dataframe

1b3c991

added refitting function and plot on score vs performance metrics

30b38d3

added simulation run dataframe

e2691a9

jshinm added 6 commits June 29, 2022 14:42

Added binning figure and plotly figures

25a1e98

Added binning figure and changed unit of the size to MB

5e8f1a6

Add sparse parity example under ensemble section

b73915e

Add cc18 example under ensemble section

67e04ef

Use tuned parameters and improve reproducibility

da00a8b

Use selected datasets from cc18 suite and pre-tuned parameters

e30af84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interpretability example notebooks #21

Add interpretability example notebooks #21

jshinm commented May 26, 2022 •

edited

adam2392 left a comment •

edited by jshinm

jshinm commented Jun 14, 2022 •

edited

adam2392 commented Jun 29, 2022

Add interpretability example notebooks #21

Are you sure you want to change the base?

Add interpretability example notebooks #21

Conversation

jshinm commented May 26, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

adam2392 left a comment • edited by jshinm

Choose a reason for hiding this comment

jshinm commented Jun 14, 2022 • edited

adam2392 commented Jun 29, 2022

jshinm commented May 26, 2022 •

edited

adam2392 left a comment •

edited by jshinm

jshinm commented Jun 14, 2022 •

edited