Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATE over subsets with low and high estimated CATEs - nonsensical results #1287

Open
RamirezAmayaS opened this issue Apr 9, 2023 · 4 comments
Labels

Comments

@RamirezAmayaS
Copy link

RamirezAmayaS commented Apr 9, 2023

Description of the bug
I am revisiting the analysis of Athey and Wager (2019). I am interested in running a falsification analysis where the causal forests are trained not on the student and school covariates but rather on randomly generated vectors. My prior is that the heterogeneity tests should fail to reject the null of no heterogeneity. However when comparing subsets with high and low estimated CATEs, the estimated average treatment effect on the high subset is close to zero and the estimated average treatment effect on the low subset is order of magnitudes larger. I can't find an explanation for this behavior. Is it a bug?

The other tests seem fine. The global ATE is close enough to the original results. The calibration test fails to reject the null of no heterogeneity.

Steps to reproduce

library(grf)

df = read.csv("experiments/acic18/synthetic_data.csv")

X = matrix(runif(n=nrow(df)*10),nrow=nrow(df))
X.colnames = c("RF1","RF2","RF3","RF4","RF5","RF6","RF7","RF8","RF9","RF0")

Z = df$Z
Y = df$Y

Y.forest = regression_forest(
    X , 
    Y
)

Y.hat = predict(Y.forest)$ predictions
Z.forest = regression_forest(
    X , 
    Z
)

Z.hat = predict(Z.forest)$predictions

cf.raw = causal_forest(
    X, 
    Y, 
    Z,
    Y.hat = Y.hat, 
    W.hat = Z.hat
)

varimp = variable_importance(cf.raw)
selected.idx = which(varimp > mean(varimp))

cf = causal_forest(
    X[,selected.idx], 
    Y, 
    Z,
    Y.hat = Y.hat, 
    W.hat = Z.hat,
    tune.parameters = "all"
)

tau.df = predict(cf,estimate.variance=TRUE)[,c(1,2)]
tau.hat = tau.df$predictions

# Distribution of predicted effects
hist(tau.hat)
# Average trearment effect
ATE = average_treatment_effect(cf)
paste(
    "95% CI for the ATE:", 
    round(ATE[1],3), 
    "+/-", 
    round(qnorm(0.975)*ATE[2],3)
)

Outputs: '95% CI for the ATE: 0.303 +/- 0.026'

# Compare regions with high and low estimated CATE
high_effect = tau.hat.unsorted > median(tau.hat.unsorted)
ate.high = average_treatment_effect(cf, subset=high_effect)
ate.low = average_treatment_effect(cf, subset=!high_effect)
paste(
    "95% CI for the difference in ATE:",
    round(ate.high[1] - ate.low[1],3),
    "+/-",
    round(qnorm(0.975)*sqrt(ate.high[2]^2 + ate.low[2]^2),3)
)

Outputs: '95% CI for the difference in ATE: -0.56 +/- 0.051'

average_treatment_effect(cf, subset=high_effect)

Outputs: estimate:-0.00124768810374905 std.err: 0.0182608951524164

average_treatment_effect(cf, subset=!high_effect)

Outputs: estimate: 0.608046759001875 std.err: 0.0182508648601049

# Test calibration
test_calibration(cf)

Outputs:

Best linear fit using forest predictions (on held-out data)
as well as the mean forest prediction as regressors, along
with one-sided heteroskedasticity-robust (HC3) SEs:

                                  Estimate  Std. Error t value Pr(>t)    
mean.forest.prediction            1.001729    0.041462  24.160 <2e-16 ***
differential.forest.prediction -682.911383   24.255158 -28.155      1    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

GRF version
grf_2.2.1

@RamirezAmayaS RamirezAmayaS changed the title ATE over subset with low and high estimated CATEs - nonsensical results ATE over subsets with low and high estimated CATEs - nonsensical results Apr 9, 2023
@erikcs
Copy link
Member

erikcs commented Apr 11, 2023

Hi @RamirezAmayaS, what you are observing is unfortunately a known artifact of doing these kinds of evaluations using Out-of-Bag (OOB) estimates. The suggested modern approach is to use the RATE with a training and evaluation sample. If you repeat your example from above, then you should see a flat TOC curve / zero RATE (when using a train/test split).

@RamirezAmayaS
Copy link
Author

RamirezAmayaS commented Apr 11, 2023

Hi @erikcs , thanks for the suggestion. I'll try the RATE approach. Do you know of any reference explaining why the OOB evaluation fails by any chance?

@erikcs
Copy link
Member

erikcs commented Apr 14, 2023

I'm not sure about reference, but here is a simple example illustrating the issue with an OOB mean:

Let $Y_i \sim Bernoulli(\mu)$, $i=1...n$, with mean $\mu=0.5$.

Then $\mu^{(-1)} = \mu - (Y_i - \bar Y) / (n - 1)$ and

$E[Y_i | \mu^{(-1)} &gt; 0.5] = 0$

$E[Y_i | \mu^{(-1)} &lt; 0.5] = 1$.

@RamirezAmayaS
Copy link
Author

Thanks for your reply.

I don't think I'm following. Shouldn't the OOB mean be $\mu_{j}^{(-1)} = \frac{1}{(n-1)} \sum_{i \neq j}{Y_i}$ ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants