Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate results on diabetes data from paper #129

Open
abhishek-ghose opened this issue Aug 20, 2022 · 6 comments
Open

Unable to replicate results on diabetes data from paper #129

abhishek-ghose opened this issue Aug 20, 2022 · 6 comments

Comments

@abhishek-ghose
Copy link

abhishek-ghose commented Aug 20, 2022

Hi,

I was trying to replicate some of the Random Forest results from the paper, specifically Figure 3(D) for the diabetes dataset - but I am unable to see the gap in AUC, as presented in the paper. Its probably me doing something silly :) - appreciate some help!

To simplify identifying a good max_depth for a Random Forest object, I'm using this class- this allows me to use scikit's GridSearchCV:

import utils as data_utils
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.base import BaseEstimator
from imodels import HSTreeClassifierCV
from matplotlib import pyplot as plt
import seaborn as sns; sns.set()

class HSRF(BaseEstimator):
    def __init__(self, reg_param_space, num_trees, max_depth):
        self.reg_param_space = reg_param_space
        self.num_trees = num_trees
        self.max_depth = max_depth
        self.HSTCV = None
        self.classes_ = None

    def fit(self, X, y):
        base_clf = RandomForestClassifier(n_estimators=self.num_trees, max_depth=self.max_depth)
        clf = HSTreeClassifierCV(base_clf, reg_param_list=self.reg_param_space)
        clf.fit(X, y)
        self.HSTCV = clf
        # this is needed for scikit's scorer to work
        self.classes_ = clf.estimator_.classes_
        return clf

    def predict(self, X):
        return self.HSTCV.predict(X)

    def predict_proba(self, X):
        return self.HSTCV.predict_proba(X)`

And here's my code - the X and y values passed in are from the diabetes dataset:

def run_mwe(X, y):
    reg_param_space = [0.1, 1.0, 10.0, 25.0, 50.0, 100.0]  # these are from the paper, Section 4.2
    num_trees_space = np.arange(2, 21, 2)
    max_depth_space = np.arange(1, 12, 2)

    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7)
    num_folds = 3
    df = pd.DataFrame(columns=['method', 'score', 'num_trees', 'cv_max_depth'])

    # RF experiments first
    for nt in num_trees_space:
        base_clf = RandomForestClassifier(class_weight='balanced', n_estimators=nt)
        clf = GridSearchCV(base_clf, cv=num_folds, param_grid={'max_depth': max_depth_space}, refit=True,
                           scoring='roc_auc')
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append({'method': 'RF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
                       ignore_index=True)

    # HS Tree experiments next
    for nt in num_trees_space:
        clf = GridSearchCV(HSRF(reg_param_space=reg_param_space, num_trees=nt, max_depth=1),
                            param_grid={'max_depth': max_depth_space}, cv=num_folds, scoring='roc_auc', verbose=4,
                            refit=True)
        clf.fit(X_train, y_train)
        pr = clf.best_estimator_.predict_proba(X_test)[:, 1]
        score = roc_auc_score(y_test, pr)
        df = df.append(
            {'method': 'HSRF', 'score': score, 'num_trees': nt, 'cv_max_depth': clf.best_params_['max_depth']},
            ignore_index=True)

    return df

When I plot the columns score against num_trees in df, I see something like this:
image

@csinva
Copy link
Owner

csinva commented Aug 21, 2022

Thanks for your interest @abhishek-ghose! Sorry you've had a hard time replicating the result...not sure I see any blatant error with your code.

@aagarwal1996 - could you point Abhishek here to the exact code we used to run this experiment in the paper?

@abhishek-ghose
Copy link
Author

abhishek-ghose commented Aug 21, 2022

Thank you for your response @csinva, and thank you for reviewing the snippet!
Since my initial post, I have investigated a bit more. Made these changes:

  1. The RF-only experiments use class_weight='balanced' in their RandomForestClassifier objects, but I forgot to set this parameter in the HSRF class. Not sure how much impact this has, but to match up configurations, now the HSRF class also sets this parameter.
  2. The HSTreeClassifierCVclass uses a default number of folds=3; in my later experiments I set this to 5 for more rigor, and also to have more train data available per fold.
  3. Finally, I re-ran experiments over a bunch of datasets - diabetes, covtype.binary, ionosphere, adult, phishing, haberman - 10 times.

The train:test splits were 70%:30%, (stratified by label) and I limited the total datasize to ~3000 points where there was more data available, e.g., covtype.binary.

This is what the plots look like now - note the titles, they have the dataset name , number of trials, number of folds use to select the best max_depth for a given number of trees, and the number of folds for picking the regularization param, referred to as reg_param:

image

I tried to quantify predictability of HS being better by reporting p-values of a Wilcoxon test (lower is better, i.e., that HS leads to improvements is definitive), where for a dataset, the pairing is based on the number of trees. I also report closeness by calculating the RMSE between the RF and HSRF AUC values (higher is better when the Wilcoxon p is small, i.e., HSRF leads to large improvements). For either test, the mean AUC score across trials is used. Here are those numbers:

dataset p RMSE
0 diabetes 0.193359 0.0225293
1 covtype.binary 0.193359 0.0252946
2 ionosphere 0.695312 0.0392162
3 adult 0.0371094 0.00912343
4 phishing 0.921875 0.00572506
5 haberman 0.0195312 0.10074

@abhishek-ghose
Copy link
Author

abhishek-ghose commented Aug 23, 2022

I have shared some code here (works for the diabetes dataset as-is): https://colab.research.google.com/drive/1UbR77KcWCvp5QyIokVbW3LvsnTz4MBHI?usp=sharing

@abhishek-ghose
Copy link
Author

H @csinva @aagarwal1996 - any pointers you can provide? Thanks!

@aagarwal1996
Copy link
Collaborator

Hi @abhishek-ghose thanks for checking in again. I'll provide some code that can reproduce our experiments shortly. One quick thing I noticed is that you seem to be tuning over max_depth for HS as well, whereas we used the default parameters for RF and only tune over the reg parameter

@abhishek-ghose
Copy link
Author

abhishek-ghose commented Aug 27, 2022

Thanks @aagarwal1996 - I tried out some experiments with not setting the max_depth parameter for RFs and I now see a gap (but this leads to some questions/observations, I have mentioned them towards the end of the comment). I performed the following experiments:

  1. I repeated the experiments where I am performing a CV over max_depth for both RF and HSRF. As before, results from 10 trials are presented, the number of folds for finding max_depth and the regularization param are both set to 5.
  2. Two new sets of experiments, where I am using the default max_depth setting for scikit's RF, which is None. I am calling these RFn and HSRFn (note the suffix n, for max_depth=None). As per scikit, this is what the setting means: If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples..

This is what those results look like.

image

For each dataset, I also calculated the p-value of the paired two-sided Wilcoxon signed-rank test. The pairing is over the number of trees in the RF(n)/HSRF(n) models, and average of trial scores are used. I performed the tests separately for RF-HSRF and RFn-HSRFn. Lower values are better, i.e., they indicate a significant difference between the AUC scores produced by the two techniques. Here they are:

dataset method1 method2 p
diabetes RF HSRF 0.322266
diabetes RFn HSRFn 0.00195312
covtype.binary RF HSRF 0.0371094
covtype.binary RFn HSRFn 0.00195312
ionosphere RF HSRF 0.275391
ionosphere RFn HSRFn 0.275391
adult RF HSRF 0.00976562
adult RFn HSRFn 0.00195312
phishing RF HSRF 0.193359
phishing RFn HSRFn 0.00195312
haberman RF HSRF 0.375
haberman RFn HSRFn 0.00195312

Now we see a gap; this is between RFn and HSRFn. But if I learn the max_depth as well (RF vs HSRF), this gap significantly diminishes, e.g., diabetes. Given that the setting of max_depth=None grows out the individual trees as far out as possible, potentially overfitting them, would it be correct to infer that hierarchical shrinkage is an alternative way to prevent overfitting - and other ways, like controlling the max_depth, are competitive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants