"Coleman" approach to feature_importances_ #229

adam2392 · 2024-02-22T20:05:03Z

The basic idea to get a feature_importances distribution map from the Coleman approach is:

train one forest on X
train another forest on permuted X
compute feature_importances array across all trees (n_trees, n_features_in,) for both forests
do the "Coleman" idea and resample from both M times

You have your feature importances null that you can compare against the feature_importances_ array in the first forest in step 1.

Code that builds a Coleman forest approach for doing multivariate hypothesis testing:

scikit-tree/sktree/stats/forestht.py

Line 1222 in 95e2597

def build_coleman_forest(

cc: @jovo @jdey4

jovo · 2024-02-22T20:10:59Z

for step 4, i think we just compute the distribution of feature importance under the null, and then, we can compute a p-value for the importance of each feature under the alternative, right?

adam2392 · 2024-02-22T20:12:43Z

Oh I guess the permuted forest technically gives that(?), but I was assuming you wanted like M forests each with a slightly different feature_importances map constructed from a different collection of trees?

jovo · 2024-02-22T20:23:01Z

oh, I thought just 1 null forest. We compute feature_importance for all the features.

We need M forests for the p-value computation for two-sample testing, but I don't think we need more than 1 forest for this?

jdey4 · 2024-02-29T02:53:14Z

@jovo from the above steps as mentioned by @adam2392, I thought we wanted distribution of feature_importance score. But if I understood it correctly today, you want rank, right? I get rank from the permuted forest and then get rank from non-permuted forest and count the number of times each feature ranks higher in non-permuted one than that in the permuted case? Should I repeat the process several times or you want to subsample after training a random forest with huge number of trees? I repeated the experiment for several reps as the feature dimension is 1.5 million and there is higher variance in forest with 100 trees.

jovo · 2024-03-04T16:26:26Z

@jdey4 write pseudocode so we are super clear, then i can quibble anything i don't like.

jdey4 · 2024-03-10T03:12:34Z

Steps:

Consider n iid samples $D_n = (x_i, y_i)_{i=1}^n$ and permuted labels data $\tilde{D}_n$.
Train 2 random forests (B trees each) with $D_n$ and $\tilde{D}_n$ : RF and RF_0, respectively.
Consider a specific feature $F_j$. Calculate it's rank from RF: r
Calculate rank for $F_j$ from $RF_0$: $r_0$.
Calculate $r_0 - r$.
Now randomly sample B trees from ${RF, RF_0}: RF*$ and denote the rest of the trees as $RF*_0$.
Caculate $r*_0-r*$ as 6.
Repeat 6 and 7 $N_0$ times.
Calculate $p = \frac{1}{N_0+1} [1 + \sum I((r_0-r) \leq (r*_0-r*))]$.
Thoughts @jovo @adam2392 ? Should I make a PR somewhere in sktree or first make sure it works in the CMI repo?

jdey4 mentioned this issue Apr 7, 2024

Do Coleman approach for feature importance testing #253

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Coleman" approach to feature_importances_ #229

"Coleman" approach to feature_importances_ #229

adam2392 commented Feb 22, 2024

jovo commented Feb 22, 2024

adam2392 commented Feb 22, 2024

jovo commented Feb 22, 2024

jdey4 commented Feb 29, 2024 •

edited

jovo commented Mar 4, 2024

jdey4 commented Mar 10, 2024 •

edited

"Coleman" approach to feature_importances_ #229

"Coleman" approach to feature_importances_ #229

Comments

adam2392 commented Feb 22, 2024

jovo commented Feb 22, 2024

adam2392 commented Feb 22, 2024

jovo commented Feb 22, 2024

jdey4 commented Feb 29, 2024 • edited

jovo commented Mar 4, 2024

jdey4 commented Mar 10, 2024 • edited

jdey4 commented Feb 29, 2024 •

edited

jdey4 commented Mar 10, 2024 •

edited