DRtester does not work for binary treatment AND binary outcome #875

yanisvdc · 2024-04-09T23:06:28Z

Hi, here is the code to reproduce the error:

import numpy as np
import pandas as pd
import scipy.stats as st

from econml.metalearners import TLearner
from econml.dml import DML

from econml.validate.drtester import DRtester

from sklearn.ensemble import (RandomForestClassifier, RandomForestRegressor,
                              GradientBoostingClassifier, GradientBoostingRegressor, 
                              AdaBoostClassifier, AdaBoostRegressor)

from sklearn.svm import SVC, SVR
from sklearn.neural_network import MLPClassifier, MLPRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
                                
    # Classifiers
from sklearn.dummy import DummyClassifier

np.random.seed(123)

N = 20000  # number of units
K = 5  # number of covariates
num_treatments = 2 # number of treatments (excluding control)

# Generate random Xs
X_mu = np.zeros(5)  # Means of Xs
# Random covariance matrix of Xs
X_sig = np.diag(np.random.rand(5))
X = st.multivariate_normal(X_mu, X_sig).rvs(N)

# Effect of Xs on outcome
X_beta = np.random.uniform(0, 5, K)
# Effect of treatment on outcomes
D_beta = np.array([0, 1, 2])
# Effect of treatment on outcome conditional on X1
DX1_beta = np.array([0, 0, 3])

# Generate treatments based on X and random noise
beta_treat = np.random.uniform(-1, 1, (num_treatments + 1, K))
D1 = np.zeros((N, num_treatments + 1))
for k in range(num_treatments + 1):
    D1[:, k] = X @ beta_treat[k, :] + np.random.gumbel(0, 1, N)
D = np.array([np.where(D1[i, :] == np.max(D1[i, :]))[0][0] for i in range(N)])
D_dum = pd.get_dummies(D)

# Generate Y (based on X, D, and random noise)
Y_sig = 1  # Variance of random outcome noise
Y = X @ X_beta + (D_dum @ D_beta) + X[:, 1] * (D_dum @ DX1_beta) + np.random.normal(0, Y_sig, N)
Y = Y.to_numpy()

# Split into training/validation samples
train_prop = .5
train_N = np.ceil(train_prop * N)
ind = np.array(range(N))
train_ind = np.random.choice(N, int(train_N), replace=False)
val_ind = ind[~np.isin(ind, train_ind)]

Xtrain, Dtrain, Ytrain = X[train_ind], D[train_ind], Y[train_ind]
Xval, Dval, Yval = X[val_ind], D[val_ind], Y[val_ind]



model_regression = GradientBoostingRegressor(random_state=0)
model_propensity = RandomForestClassifier(random_state=0)

nval = len(Dval)
ntrain = len(Dtrain)
Dval = np.random.choice(2, nval)
Dtrain = np.random.choice(2, ntrain)

nval = len(Yval)
ntrain = len(Ytrain)
Yval = np.random.choice(2, nval)
Ytrain = np.random.choice(2, ntrain)

est_dm = DML(model_y = model_propensity, 
             model_t= model_propensity,
             model_final = model_regression,
             discrete_outcome=True,
             discrete_treatment=True,
             cv=5)

est_dm.fit(Ytrain, Dtrain, X=Xtrain)

# Initialize DRTester and fit/predict nuisance models
dml_tester = DRtester(
    model_regression=model_regression, 
    model_propensity=model_propensity,
    cate=est_dm
).fit_nuisance(Xval, Dval, Yval, Xtrain, Dtrain, Ytrain)

res_dml = dml_tester.evaluate_all(Xval, Xtrain)
res_dml.summary()

Error message:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_28104/1480639724.py in <module>
     92 ).fit_nuisance(Xval, Dval, Yval, Xtrain, Dtrain, Ytrain)
     93 
---> 94 res_dml = dml_tester.evaluate_all(Xval, Xtrain)
     95 res_dml.summary()

[~/.local/lib/python3.10/site-packages/econml/validate/drtester.py](http://localhost:8888/home/yvdc/.local/lib/python3.10/site-packages/econml/validate/drtester.py) in evaluate_all(self, Xval, Xtrain, n_groups)
    594             self.get_cate_preds(Xval, Xtrain)
    595 
--> 596         blp_res = self.evaluate_blp()
    597         cal_res = self.evaluate_cal(n_groups=n_groups)
    598         qini_res = self.evaluate_uplift(metric='qini')

[~/.local/lib/python3.10/site-packages/econml/validate/drtester.py](http://localhost:8888/home/yvdc/.local/lib/python3.10/site-packages/econml/validate/drtester.py) in evaluate_blp(self, Xval, Xtrain)
    462 
    463         if self.n_treat == 1:  # binary treatment
--> 464             reg = OLS(self.dr_val_, add_constant(self.cate_preds_val_)).fit()
    465             params = [reg.params[1]]
    466             errs = [reg.bse[1]]

[~/.local/lib/python3.10/site-packages/statsmodels/tools/tools.py](http://localhost:8888/home/yvdc/.local/lib/python3.10/site-packages/statsmodels/tools/tools.py) in add_constant(data, prepend, has_constant)
    191         x = x[:, None]
    192     elif x.ndim > 2:
--> 193         raise ValueError('Only implemented for 2-dimensional arrays')
    194 
    195     is_nonzero_const = np.ptp(x, axis=0) == 0

ValueError: Only implemented for 2-dimensional arrays

It does work when the outcome y is continuous.

The text was updated successfully, but these errors were encountered:

yanisvdc · 2024-04-09T23:46:36Z

Interestingly, if I remove discrete_treatment = True, and if I put a regressor for model_y (even though y is binary in my case), the code will run; but not sure if the result will be valid, it could be since model_y still estimates the probability y=1, which is the supposed behavior when discrete_treatment = True and with a classifier as model_y. Please let me know if the result would be valid in that case or if you see how to modify the code to make it work with discrete_treatment = True and with a classifier as model_y.

kbattocchi · 2024-04-11T17:13:47Z

Interestingly, if I remove discrete_treatment = True, and if I put a regressor for model_y (even though y is binary in my case), the code will run; but not sure if the result will be valid, it could be since model_y still estimates the probability y=1, which is the supposed behavior when discrete_treatment = True and with a classifier as model_y. Please let me know if the result would be valid in that case or if you see how to modify the code to make it work with discrete_treatment = True and with a classifier as model_y.

Did you mean remove discrete_outcome=True? DRTester is only designed for discrete treatments, so you should certainly not change the discrete_treatment argument. It does look like a bug that you can't use discrete_outcome=True, but I think switching to a regressor should be fine in most cases.

On an unrelated note, I would not use the DML class directly - if you want a non-parametric final model you should either use NonParamDML if you want to use an arbitrary final model of your choosing (but this only supports a single treatment and outcome), or use CausalForestDML if you have an arbitrary number of treatments and outcomes and want confidence intervals (but the final model is limited to being a CausalForest).

yanisvdc · 2024-04-11T17:17:51Z

Yes I meant removing discrete_outcome = True
Do you have the same bug as me when trying to run the code, do we agree that it should work and that something unexpected happen within the library that is not under my control, or should I try to dig more? Thanks!

kbattocchi · 2024-04-11T17:44:15Z

Yes, this is a bug on our end, but using a continuous outcome (even if it's really discrete) should be fine as a workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRtester does not work for binary treatment AND binary outcome #875

DRtester does not work for binary treatment AND binary outcome #875

yanisvdc commented Apr 9, 2024 •

edited

yanisvdc commented Apr 9, 2024

kbattocchi commented Apr 11, 2024

yanisvdc commented Apr 11, 2024

kbattocchi commented Apr 11, 2024

DRtester does not work for binary treatment AND binary outcome #875

DRtester does not work for binary treatment AND binary outcome #875

Comments

yanisvdc commented Apr 9, 2024 • edited

yanisvdc commented Apr 9, 2024

kbattocchi commented Apr 11, 2024

yanisvdc commented Apr 11, 2024

kbattocchi commented Apr 11, 2024

yanisvdc commented Apr 9, 2024 •

edited