Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

adam2392 · 2023-11-14T20:14:02Z

First reported by @sampan501

We then added the following LOC, which we can probably extend further to the "power" setting, where we actually estimate the power and see if the performance still differs:

scikit-tree/sktree/tests/test_honest_forest.py

Lines 261 to 335 in 9c84d6f

    
           def test_honest_forest_with_sklearn_trees(): 
        
               """Test against regression in power-curves discussed in: 
        
               https://github.com/neurodata/scikit-tree/pull/157.""" 
        
               # generate the high-dimensional quadratic data 
        
               X, y = make_quadratic_classification(1024, 4096, noise=True, seed=0) 
        
               y = y.squeeze() 
        
               print(X.shape, y.shape) 
        
               print(np.sum(y) / len(y)) 
        
               clf = HonestForestClassifier( 
        
                   n_estimators=10, tree_estimator=skDecisionTreeClassifier(), random_state=0 
        
               ) 
        
               honestsk_scores = cross_val_score(clf, X, y, cv=5) 
        
               print(honestsk_scores) 
        
               clf = HonestForestClassifier( 
        
                   n_estimators=10, tree_estimator=DecisionTreeClassifier(), random_state=0 
        
               ) 
        
               honest_scores = cross_val_score(clf, X, y, cv=5) 
        
               print(honest_scores) 
        
               # XXX: surprisingly, when we use the default which uses the fork DecisionTree, 
        
               # we get different results 
        
               # clf = HonestForestClassifier(n_estimators=10, random_state=0) 
        
               # honest_scores = cross_val_score(clf, X, y, cv=5) 
        
               # print(honest_scores) 
        
               print(honestsk_scores, honest_scores) 
        
               print(np.mean(honestsk_scores), np.mean(honest_scores)) 
        
               assert_allclose(np.mean(honestsk_scores), np.mean(honest_scores)) 
        
           def test_honest_forest_with_sklearn_trees_with_auc(): 
        
               """Test against regression in power-curves discussed in: 
        
               https://github.com/neurodata/scikit-tree/pull/157. 
        
               This unit-test tests the equivalent of the AUC using sklearn's DTC 
        
               vs our forked version of sklearn's DTC as the base tree. 
        
               """ 
        
               skForest = HonestForestClassifier( 
        
                   n_estimators=10, tree_estimator=skDecisionTreeClassifier(), random_state=0 
        
               ) 
        
               Forest = HonestForestClassifier( 
        
                   n_estimators=10, tree_estimator=DecisionTreeClassifier(), random_state=0 
        
               ) 
        
               max_fpr = 0.1 
        
               scores = [] 
        
               sk_scores = [] 
        
               for idx in range(10): 
        
                   X, y = make_quadratic_classification(1024, 4096, noise=True, seed=idx) 
        
                   y = y.squeeze() 
        
                   skForest.fit(X, y) 
        
                   Forest.fit(X, y) 
        
                   # compute MI 
        
                   y_pred_proba = skForest.predict_proba(X)[:, 1].reshape(-1, 1) 
        
                   sk_mi = roc_auc_score(y, y_pred_proba, max_fpr=max_fpr) 
        
                   y_pred_proba = Forest.predict_proba(X)[:, 1].reshape(-1, 1) 
        
                   mi = roc_auc_score(y, y_pred_proba, max_fpr=max_fpr) 
        
                   scores.append(mi) 
        
                   sk_scores.append(sk_mi) 
        
               print(scores, sk_scores) 
        
               print(np.mean(scores), np.mean(sk_scores)) 
        
               print(np.std(scores), np.std(sk_scores)) 
        
               assert_allclose(np.mean(sk_scores), np.mean(scores), atol=0.005) 
        
           def test_honest_forest_with_sklearn_trees_with_mi():

Note, we have adapted the code a bit, and how the random_state is set in HonesTreeClassifier

The text was updated successfully, but these errors were encountered:

adam2392 · 2023-11-14T20:14:16Z

Cross-ref: #164

adam2392 mentioned this issue Nov 15, 2023

[ENH] Allow sampling feature sets separately in MultiViewDTC #152

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

adam2392 commented Nov 14, 2023

adam2392 commented Nov 14, 2023

Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

Comments

adam2392 commented Nov 14, 2023

adam2392 commented Nov 14, 2023