Unable to replicate results on `diabetes` data from paper #129

abhishek-ghose · 2022-08-20T00:42:44Z

Hi,

I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the diabetes dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!

To simplify identifying a good max_depth for a Random Forest object, I'm using this class- this allows me to use scikit's GridSearchCV:

import utils as data_utils
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator
from imodels import HSTreeClassifierCV
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()

class HSRF(BaseEstimator):
    def __init__(self, reg_param_space, num_trees, max_depth):
        self.reg_param_space = reg_param_space
        self.num_trees = num_trees
        self.max_depth = max_depth
        self.HSTCV = None
        self.classes_ = None

    def fit(self, X, y):
        base_clf = RandomForestClassifier(n_estimators=self.num_trees, max_depth=self.max_depth)
        clf = HSTreeClassifierCV(base_clf, reg_param_list=self.reg_param_space)
        clf.fit(X, y)
        self.HSTCV = clf
        # this is needed for scikit's scorer to work
        self.classes_ = clf.estimator_.classes_
        return clf

    def predict(self, X):
        return self.HSTCV.predict(X)

    def predict_proba(self, X):
        return self.HSTCV.predict_proba(X)`

And here's my code - the X and y values passed in are from the diabetes dataset:

def run_mwe(X, y):
    reg_param_space = [0.1, 1.0, 10.0, 25.0, 50.0, 100.0]  # these are from the paper, Section 4.2
    num_trees_space = np.arange(2, 21, 2)
    max_depth_space = np.arange(1, 12, 2)

    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7)
    num_folds = 3
    df = pd.DataFrame(columns=['method', 'score', 'num_trees', 'cv_max_depth'])

    # RF experiments first
    for nt in num_trees_space:
        base_clf = RandomForestClassifier(class_weight='balanced', n_estimators=nt)
        clf = GridSearchCV(base_clf, cv=num_folds, param_grid={'max_depth': max_depth_space}, refit=True,
                           scoring='roc_auc')
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append({'method': 'RF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
                       ignore_index=True)

    # HS Tree experiments next
    for nt in num_trees_space:
        clf = GridSearchCV(HSRF(reg_param_space=reg_param_space, num_trees=nt, max_depth=1),
                            param_grid={'max_depth': max_depth_space}, cv=num_folds, scoring='roc_auc', verbose=4,
                            refit=True)
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append(
            {'method': 'HSRF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
            ignore_index=True)

    return df

When I plot the columns score against num_trees in df, I see something like this:

The text was updated successfully, but these errors were encountered:

csinva · 2022-08-21T00:10:00Z

Thanks for your interest @abhishek-ghose! Sorry you've had a hard time replicating the result...not sure I see any blatant error with your code.

@aagarwal1996 - could you point Abhishek here to the exact code we used to run this experiment in the paper?

abhishek-ghose · 2022-08-21T01:07:40Z

Thank you for your response @csinva, and thank you for reviewing the snippet!
Since my initial post, I have investigated a bit more. Made these changes:

The RF-only experiments use class_weight='balanced' in their RandomForestClassifier objects, but I forgot to set this parameter in the HSRF class. Not sure how much impact this has, but to match up configurations, now the HSRF class also sets this parameter.
The HSTreeClassifierCVclass uses a default number of folds=3; in my later experiments I set this to 5 for more rigor, and also to have more train data available per fold.
Finally, I re-ran experiments over a bunch of datasets - diabetes, covtype.binary, ionosphere, adult, phishing, haberman - 10 times.

The train:test splits were 70%:30%, (stratified by label) and I limited the total datasize to ~3000 points where there was more data available, e.g., covtype.binary.

This is what the plots look like now - note the titles, they have the dataset name , number of trials, number of folds use to select the best max_depth for a given number of trees, and the number of folds for picking the regularization param, referred to as reg_param:

I tried to quantify predictability of HS being better by reporting p-values of a Wilcoxon test (lower is better, i.e., that HS leads to improvements is definitive), where for a dataset, the pairing is based on the number of trees. I also report closeness by calculating the RMSE between the RF and HSRF AUC values (higher is better when the Wilcoxon p is small, i.e., HSRF leads to large improvements). For either test, the mean AUC score across trials is used. Here are those numbers:

	dataset	p	RMSE
0	diabetes	0.193359	0.0225293
1	covtype.binary	0.193359	0.0252946
2	ionosphere	0.695312	0.0392162
3	adult	0.0371094	0.00912343
4	phishing	0.921875	0.00572506
5	haberman	0.0195312	0.10074

abhishek-ghose · 2022-08-23T20:41:07Z

I have shared some code here (works for the diabetes dataset as-is): https://colab.research.google.com/drive/1UbR77KcWCvp5QyIokVbW3LvsnTz4MBHI?usp=sharing

abhishek-ghose · 2022-08-26T19:41:47Z

H @csinva @aagarwal1996 - any pointers you can provide? Thanks!

aagarwal1996 · 2022-08-26T19:49:33Z

Hi @abhishek-ghose thanks for checking in again. I'll provide some code that can reproduce our experiments shortly. One quick thing I noticed is that you seem to be tuning over max_depth for HS as well, whereas we used the default parameters for RF and only tune over the reg parameter

abhishek-ghose · 2022-08-27T17:16:51Z

Thanks @aagarwal1996 - I tried out some experiments with not setting the max_depth parameter for RFs and I now see a gap (but this leads to some questions/observations, I have mentioned them towards the end of the comment). I performed the following experiments:

I repeated the experiments where I am performing a CV over max_depth for both RF and HSRF. As before, results from 10 trials are presented, the number of folds for finding max_depth and the regularization param are both set to 5.
Two new sets of experiments, where I am using the default max_depth setting for scikit's RF, which is None. I am calling these RFn and HSRFn (note the suffix n, for max_depth=None). As per scikit, this is what the setting means: If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples..

This is what those results look like.

For each dataset, I also calculated the p-value of the paired two-sided Wilcoxon signed-rank test. The pairing is over the number of trees in the RF(n)/HSRF(n) models, and average of trial scores are used. I performed the tests separately for RF-HSRF and RFn-HSRFn. Lower values are better, i.e., they indicate a significant difference between the AUC scores produced by the two techniques. Here they are:

dataset	method1	method2	p
diabetes	RF	HSRF	0.322266
diabetes	RFn	HSRFn	0.00195312
covtype.binary	RF	HSRF	0.0371094
covtype.binary	RFn	HSRFn	0.00195312
ionosphere	RF	HSRF	0.275391
ionosphere	RFn	HSRFn	0.275391
adult	RF	HSRF	0.00976562
adult	RFn	HSRFn	0.00195312
phishing	RF	HSRF	0.193359
phishing	RFn	HSRFn	0.00195312
haberman	RF	HSRF	0.375
haberman	RFn	HSRFn	0.00195312

Now we see a gap; this is between RFn and HSRFn. But if I learn the max_depth as well (RF vs HSRF), this gap significantly diminishes, e.g., diabetes. Given that the setting of max_depth=None grows out the individual trees as far out as possible, potentially overfitting them, would it be correct to infer that hierarchical shrinkage is an alternative way to prevent overfitting - and other ways, like controlling the max_depth, are competitive?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to replicate results on `diabetes` data from paper #129

Unable to replicate results on `diabetes` data from paper #129

abhishek-ghose commented Aug 20, 2022 •

edited

csinva commented Aug 21, 2022

abhishek-ghose commented Aug 21, 2022 •

edited

abhishek-ghose commented Aug 23, 2022 •

edited

abhishek-ghose commented Aug 26, 2022

aagarwal1996 commented Aug 26, 2022

abhishek-ghose commented Aug 27, 2022 •

edited

Unable to replicate results on diabetes data from paper #129

Unable to replicate results on diabetes data from paper #129

Comments

abhishek-ghose commented Aug 20, 2022 • edited

csinva commented Aug 21, 2022

abhishek-ghose commented Aug 21, 2022 • edited

abhishek-ghose commented Aug 23, 2022 • edited

abhishek-ghose commented Aug 26, 2022

aagarwal1996 commented Aug 26, 2022

abhishek-ghose commented Aug 27, 2022 • edited

Unable to replicate results on `diabetes` data from paper #129

Unable to replicate results on `diabetes` data from paper #129

abhishek-ghose commented Aug 20, 2022 •

edited

abhishek-ghose commented Aug 21, 2022 •

edited

abhishek-ghose commented Aug 23, 2022 •

edited

abhishek-ghose commented Aug 27, 2022 •

edited