Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unbalanced Panel" when groups are different sizes #71

Open
alecmcclean opened this issue Apr 22, 2020 · 13 comments
Open

"Unbalanced Panel" when groups are different sizes #71

alecmcclean opened this issue Apr 22, 2020 · 13 comments

Comments

@alecmcclean
Copy link

Hi Guys,

First - thanks a bunch for translating this package to R, I really appreciate it. I just wanted to flag a small issue I've found when using the bacon() function.

It seems that bacon() does not currently allow our groups to be different sizes. I've appended the code to generate a minimal example. In the dataset I create, we have 3 groups (id == 1, 2, 3), where id == 1 | 3 contain one individual, and id == 2 contains two individuals (ind_id is the individual id).

If I run bacon(id_var == "group_id", ...) the function will throw an error for an "Unbalanced Panel", because group 2 has twice as many time periods within it as group 1 (because there are two individuals in group 2).

But, I don't think you want to call that an error; otherwise, you cannot demonstrate 2x2 weighting heterogeneity arising from the size of the groups. And, from what I understand, this is one of the key takeaways of the Bacon decomposition: the larger groups retain higher weights in the 2x2.

Alternatively, if you do want to call that an unbalanced panel, I don't think you need the code calculating "n_k, n_u, n_ku", because n_k = n_u by definition and n_ku = 0.5.

Thanks again,
Alec

library(dplyr)

df <- 
  expand.grid(
    group_id = c(1, 2, 3), # Group ID (treatment level ID)
    t  = c(0, 1, 2)  # Time
  ) %>%
  mutate(
    # Treatment status
    a = case_when(
      group_id == 2 & t > 0 ~ 1, # 1 time period untreated 2 periods treated
      group_id == 3 & t > 1 ~ 1, # 2 untreated 1 treated
      T ~ 0 # id == 1 never treated
    )
  )

# Expand dataset with "individual" level observations 
df <- df %>% left_join(
  expand.grid(
    group_id = c(1, 2, 3), 
    ind_id = seq(1, 2)
    ) %>%
    filter(group_id == 2 | ind_id < 2) ## Leave only group id == 2 with two individuals
  ) %>%
  select(group_id, ind_id, everything()) %>%
  arrange(group_id, ind_id, t)
@EdJeeOnGitHub
Copy link
Collaborator

Hi Alec,

Sorry for the delay in replying.

We'll get to the bottom of this - it looks like we went a bit over the top sanitising user inputs.

@EdJeeOnGitHub
Copy link
Collaborator

EdJeeOnGitHub commented May 1, 2020

This should have been fixed in the latest PR #72 @evanjflack

library(dplyr)
set.seed(938)

df <- 
  expand.grid(
    group_id = c(1, 2, 3), # Group ID (treatment level ID)
    t  = c(0, 1, 2)  # Time
  ) %>%
  mutate(
    # Treatment status
    a = case_when(
      group_id == 2 & t > 0 ~ 1, # 1 time period untreated 2 periods treated
      group_id == 3 & t > 1 ~ 1, # 2 untreated 1 treated
      T ~ 0 # id == 1 never treated
    )
  )

# Expand dataset with "individual" level observations 
df <- df %>% left_join(
  expand.grid(
    group_id = c(1, 2, 3), 
    ind_id = seq(1, 2)
  ) %>%
    filter(group_id == 2 | ind_id < 2) ## Leave only group id == 2 with two individuals
) %>%
  select(group_id, ind_id, everything()) %>%
  arrange(group_id, ind_id, t) %>% 
  mutate(y = rnorm(nrow(.)))



bacon_res <- df %>% 
  bacon(formula = y ~ a,
        id_var = "group_id",
        time_var = "t")


bacon_res

with results:


1 Earlier vs Later Treated    0.2  0.30438
2 Later vs Earlier Treated    0.2  0.12943
3     Treated vs Untreated    0.6 -0.44938

  treated untreated   estimate weight                     type
2       1     99999 -0.8366176    0.4     Treated vs Untreated
3       2     99999  0.3250949    0.2     Treated vs Untreated
6       2         1  0.1294280    0.2 Later vs Earlier Treated
8       1         2  0.3043799    0.2 Earlier vs Later Treated

@alecmcclean
Copy link
Author

Great, thank you!

@hyeunjung
Copy link

Thank you for this package! I tested using the example code above, but my codes don't go through. I made sure that I have the most updated version of bacondecomp package, but still get an error for an unbalanced error. Could you please check if this fix for an unbalanced panel is reflected in the updated version of bacondecomp package in R?

Thank you so much for your help!

@EdJeeOnGitHub
Copy link
Collaborator

Hi,

Did you use the latest version from GitHub or CRAN?

I believe this is fixed on GitHub but looking back at the logs I'm not sure if @evanjflack pushed the patch to CRAN.

If it's broken on GitHub too I'll have another look.

Thanks,
Ed

@EdJeeOnGitHub EdJeeOnGitHub reopened this Nov 9, 2020
@hyeunjung
Copy link

hyeunjung commented Nov 10, 2020 via email

@PromiseKamanga
Copy link

Following this thread, I got the impression that the error of "unbalanced panel" was already fixed. However, I just downloaded the package from GitHub today and I still got the same error when I tried to use it. The data I am using involves bilateral trade values of multiple countries. As such I have duplicate "country-year" combinations because I observe a country's trade with all its partners in a given year. Could that explain the error? Do you have a suggestion on how I should proceed?

@kylebutts
Copy link
Collaborator

Hi @PromiseKamanga, could you open a new issue and write the code you’re trying to run that fails. I’ll be happy to help

@ridwandse
Copy link

Hi @EdJeeOnGitHub can you generate the same simulated data set on STATA and post the codes here or share this data generated in R here, I just want to see whether STATA's ddtiming gives me the same diff-in-diff estimate with same DD comparisons and weights. Just curious to learn.
Thanks

@EdJeeOnGitHub
Copy link
Collaborator

Hi @ridwandse,

The code here will provide the exact same dataset since the seed has been set.

Something like write.csv(df, "my-df.csv") will save the file for loading into Stata

@ridwandse
Copy link

Thanks, @EdJeeOnGitHub, will follow up on the same.
Actually i have unbalanced data and STATA's bacondecomp Y D, ddetail does not work with unbalanced data, it requires data to be strongly balanced. However, another way of obtaining the bacondecomposition is to use ddtiminng i.e., ddtiming Y D, i(id) t(year). This works with unbalanced case. I am not sure whether to proceed with bacondecomp in balanced case or ddtiming with unbalanced data. If you have any leads on that. Please guide through.
Thanks

@kylebutts
Copy link
Collaborator

@ridwandse I think this is incorrect. Because something "runs" and spits out numbers does not mean it "works". The weights it reports are not correct. The bacon decomposition holds only in the strongly balanced case (it's an algebraic relationship between the TWFE OLS coefficient and a bunch of different averages).

In the unbalanced case, you can calculate the weights by hand (it's a bunch of n's basically) which is what ddtiming does. The weights do not mean anything though

@ridwandse
Copy link

Thank you @kylebutts , this was very useful. Yes you are right. I also have calculated all the DD comparison weights by hand as a combination of group size and treatment indicator (D) averages over i and t, we get the same results as using ddtiming
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants