Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic simulation loses power supposedly when using sklearn-fork DecisionTree vs sklearn DecisionTree #171

Open
adam2392 opened this issue Nov 14, 2023 · 1 comment

Comments

@adam2392
Copy link
Collaborator

First reported by @sampan501

We then added the following LOC, which we can probably extend further to the "power" setting, where we actually estimate the power and see if the performance still differs:

def test_honest_forest_with_sklearn_trees():
"""Test against regression in power-curves discussed in:
https://github.com/neurodata/scikit-tree/pull/157."""
# generate the high-dimensional quadratic data
X, y = make_quadratic_classification(1024, 4096, noise=True, seed=0)
y = y.squeeze()
print(X.shape, y.shape)
print(np.sum(y) / len(y))
clf = HonestForestClassifier(
n_estimators=10, tree_estimator=skDecisionTreeClassifier(), random_state=0
)
honestsk_scores = cross_val_score(clf, X, y, cv=5)
print(honestsk_scores)
clf = HonestForestClassifier(
n_estimators=10, tree_estimator=DecisionTreeClassifier(), random_state=0
)
honest_scores = cross_val_score(clf, X, y, cv=5)
print(honest_scores)
# XXX: surprisingly, when we use the default which uses the fork DecisionTree,
# we get different results
# clf = HonestForestClassifier(n_estimators=10, random_state=0)
# honest_scores = cross_val_score(clf, X, y, cv=5)
# print(honest_scores)
print(honestsk_scores, honest_scores)
print(np.mean(honestsk_scores), np.mean(honest_scores))
assert_allclose(np.mean(honestsk_scores), np.mean(honest_scores))
def test_honest_forest_with_sklearn_trees_with_auc():
"""Test against regression in power-curves discussed in:
https://github.com/neurodata/scikit-tree/pull/157.
This unit-test tests the equivalent of the AUC using sklearn's DTC
vs our forked version of sklearn's DTC as the base tree.
"""
skForest = HonestForestClassifier(
n_estimators=10, tree_estimator=skDecisionTreeClassifier(), random_state=0
)
Forest = HonestForestClassifier(
n_estimators=10, tree_estimator=DecisionTreeClassifier(), random_state=0
)
max_fpr = 0.1
scores = []
sk_scores = []
for idx in range(10):
X, y = make_quadratic_classification(1024, 4096, noise=True, seed=idx)
y = y.squeeze()
skForest.fit(X, y)
Forest.fit(X, y)
# compute MI
y_pred_proba = skForest.predict_proba(X)[:, 1].reshape(-1, 1)
sk_mi = roc_auc_score(y, y_pred_proba, max_fpr=max_fpr)
y_pred_proba = Forest.predict_proba(X)[:, 1].reshape(-1, 1)
mi = roc_auc_score(y, y_pred_proba, max_fpr=max_fpr)
scores.append(mi)
sk_scores.append(sk_mi)
print(scores, sk_scores)
print(np.mean(scores), np.mean(sk_scores))
print(np.std(scores), np.std(sk_scores))
assert_allclose(np.mean(sk_scores), np.mean(scores), atol=0.005)
def test_honest_forest_with_sklearn_trees_with_mi():

Note, we have adapted the code a bit, and how the random_state is set in HonesTreeClassifier

@adam2392
Copy link
Collaborator Author

Cross-ref: #164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant