Name	Name	Last commit message	Last commit date
parent directory ..
readme_files/figure-markdown_github	readme_files/figure-markdown_github
readme.Rmd	readme.Rmd
readme.md	readme.md

Reproducible results and figures

Structure of the data
Reproducing results from the paper
Session information

This document presents the code needed to reproduce the empirical analysis, figures, and tables presented in the paper "Blood is thicker than bloodshed: A genealogical approach to reconstruct populations after armed conflicts". using the data collected in Rio Negro. The package EGM is needed to reproduce the results of this study. It can be downloaded from Github using the devtools package in R:

  library(devtools)
  install_github("alburezg/EGM", dep = FALSE)

After installation, the package can be loaded, together with the other packages needed for the analysis (see version numbers and other technical details at the end of this document):

library(EGM)

library(igraph)
library(ggplot2)
library(dplyr)
library(purrr)
library(stringr)
library(lubridate)
library(plyr)
library(knitr)

Structure of the data

The microdata data from Rio Negro cannot be shared in full given privacy concerns. The raw data includes sensitive information such as date of birth, date of death, and kinship ties, which could be used to uniquely identify members of the population. Nevertheless, this section provides a brief overview of the seven datasets referenced in the code below. Note that the datasets included in the EGM package are for illustrative purposes and only include the columns referenced in the analysis.

The first dataset, paradata, records information data about the EGM interviews themselves, such as household id, unique id of the respondent, number of relatives reported, etc.

paradata %>% head %>% kable

h_id	respondent1	no_u_members
1	12	47
2	2	32
3	1	81
4	12	121
5	13	14
6	5	43

The ind.q dataset is a joint collection of all individuals reported in the EGM interviews, without removing duplicates. Having the original data with duplicates is useful for studying genealogical saturation and multiple reporting.

ind.q %>% head %>% kable

h_id	interview_date	ind_id	full_id	id81	id08	idall	DoB	origin_comunidad	alive	father	mother	spouse1	spouse2	spouse3	spouse4	spouse5	spouse6	spouse7	spouse8
1	26/01/2016	1	1_1	555	575	1	28/02/1954	rio negro	1	8	9	2	NA	NA	NA	NA	NA	NA	NA
1	26/01/2016	2	1_2	554	NA	2	10/03/1939	NA	0	NA	NA	1	NA	NA	NA	NA	NA	NA	NA
1	26/01/2016	3	1_3	699	570	3	07/10/1978	rio negro	1	2	1	5	7	NA	NA	NA	NA	NA	NA
1	26/01/2016	4	1_4	NA	572	4	09/09/1996	pacux	1	7	3	6	NA	NA	NA	NA	NA	NA	NA
1	26/01/2016	5	1_5	NA	571	5	16/06/1975	canchun	1	NA	NA	3	NA	NA	NA	NA	NA	NA	NA
1	26/01/2016	6	1_6	NA	1241	6	23/01/1993	pacux	1	NA	NA	4	NA	NA	NA	NA	NA	NA	NA

The final dataset records all members of the community, reconstructed using the EGM method. It is the consolidated and deduplicated version of the ind.q dataset.

final %>% head %>% kable

ind_id	id_cuest	DoD	DoD_with_alive	DoB	alive	died_conflict	sex	sex2	origin_literal	origin_departamento	origin_local	parents	father	mother	spouses	children	cousins	grandparents	grandchildren
1	1_1;3_24;33_30	NA	2016	1954	1	FALSE	2	Female	1	baja verapaz	TRUE	8;9	8	9	2	3;120;121	NA	NA	4;60;122;3507;123
2	1_2;3_25;33_36	1984	1984	1939	0	FALSE	1	Male	coban	alta verapaz	FALSE		NA	NA	1	3;120;121	NA	NA	4;60;122;3507;123
3	1_3;3_26;33_37;34_49;25_3	NA	2016	1976	1	FALSE	2	Female	1	baja verapaz	TRUE	2;1	2	1	5;7	4;60;122;123	24;3;50;25;51;3159;75;2199;468;79;85;2200;80;86;81;87;82;76;88;83;89;84;90	8;9	47
4	1_4;33_41;34_55;3_28	NA	2016	1996	1	FALSE	2	Female	2	baja verapaz	TRUE	7;3	7	3	6	47	4;3507;1538;1539;1540;1541	484;1542;2;1	NA
5	1_5;3_27;33_44;34_58	NA	2016	1974	1	FALSE	1	Male	canchun	baja verapaz	FALSE		NA	NA	3	60;122;123	NA	NA	NA
6	1_6;3_29;33_42;34_56;44_31	NA	2016	1993	1	FALSE	1	Male	2	baja verapaz	TRUE	371;178	371	178	4	47	2720;2712;2721;6;2714;419;420;2713;421;2715;422;2716;423;2717;2718;2719	368;361;172;97	NA

yearly_nets records the number of relatives alive over time for each member of the population (whose id is given in the 'ego' column). The data is a transformation of the final dataset.

yearly_nets %>% tail %>% kable

	ego	year	relatives_alive_ext	relatives_alive_nuc
70507	3540	2016	10	2
70508	3542	2016	15	4
70509	3543	2016	15	4
70510	3545	2016	10	3
70511	3565	2016	13	3
70512	3566	2016	13	3

The analysis below also uses data from secondary sources. This was used to check the quality of the demographic information collected using the EGM. The individuals81 dataset records the 1978 census conducted by INDE power company in Rio Negro before the killings. (It is referred to as '81' because the census report was published in 1981.)

individuals81 %>% head %>% kable

ind_id	sex	age81
6	1	39
7	2	37
9	2	16
10	1	7
11	2	4
12	1	1

The individuals08 dataset records the 2008 local census conducted by COCAHICH in Rio Negro and Pacux.

individuals08 %>% head %>% kable

ind_id	sex	age08
1	1	42.23014
2	2	45.42740
3	1	24.58630
4	1	22.25753
5	2	20.37260
6	2	17.72877

The fafg_linked dataset is a transformed version of the list of Rio Negro massacre victims made available by the Guatemalan Forensic Anthropology Foundation (FAFG) (link). The data was linked with the EGM-generated records using record linkage methods (the linkage code is not included in the script).

fafg_linked %>% head %>% kable

id	idall	gender	id_cuest
1	296	female	105_97;12_69;59_46;64_5
2	6	female	1_6;3_29;33_42;34_56;44_31
3	1069	female	106_55;120_58;131_13
4	15	female	1_15
5	58	female	11_87;110_20;3_19
6	63	female	3_38;33_52

Reproducing results from the paper

In what follows, I show how these datasets were analysed to produce the results reported in the paper. Note that the code will not work using only the sample datasets included in the EGM package. Resarchers interested in replicating any particular section of the analysis are encouraged to contact the author.

Figure 2. Genealogical saturation estimated by matching EGM-generated records with list of victims and with census records

This graph shows the share of records in independent population registers that were recorded in the EGM-generated data. In order to obtain this, it was first necessary to recode the EGM-generated data chronologically by the date in which the interview was conducted:

egm_chrono <- ind.q %>% 
  select(int_id = h_id, interview_date, full_id, id81, id08, idall) %>% 
  dplyr::arrange(as.Date(ind.q$interview_date, "%d/%m/%Y"), as.numeric(ind.q$h_id))

int_id_old <- unique(egm_chrono$int_id) 

egm_chrono$int_id <- as.numeric(plyr::mapvalues(egm_chrono$int_id, from = int_id_old, to = 1:length(int_id_old)))

int_id_new <- unique(egm_chrono$int_id)

Afterwards, the EGM data was matched against the two local censuses available fro Rio Negro:

# 2.1. Match with 1978 census

col <- paste0("id",81)
df <- individuals81

# get all ids

ids <- df$ind_id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

# save in var
assign(col,sapply(l,completion))

# 2.2. Match with 2008 census

col <- paste0("id","08")
df <- individuals08

# get all ids

ids <- df$ind_id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

assign(col,sapply(l,completion))

# Number of records in each interview

l <- split(egm_chrono[,col],egm_chrono$int_id)
len <- sapply(l,length) # output as vector

# rescale to 0-0.5
len <- (len/max(len))/2

# 2.3. Match with "final" df

# length of ind.all
len <- nrow(final) 

col <- "idall"

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

idall <- unlist(l)

cuts <- unlist(lapply(l,length))

quest <- rep(unique(egm_chrono$int_id), cuts)

df <- data.frame(quest,idall)

matched <- rep(NA, len)

for(n in seq_along(df$idall)) {
  line <- match(df$idall[n], final$ind_id)
  if (is.na(matched[line]))
    matched[line] <- df$quest[n]
}

new_records <- as.numeric(as.matrix(table(matched)))

not <- int_id_new[!( int_id_new %in% unique(matched) )]
pos <- match(not,int_id_new)

new_records <- append(new_records, 0, pos[1])
new_records <- append(new_records, 0, pos[2])

idall_or <- cumsum(new_records)

idall <- idall_or/len

After this, the EGM-records were matched against another independent source of demographic data: lists of victims compiled by the FAFG:

# 3. Match with FAFG Data

# Assign links to ind.q

egm_chrono$id_fafg <- NA

for(n in 1:nrow(fafg_linked)) {
  id_cuest <- unlist( strsplit(fafg_linked$id_cuest[n], ";") )
  egm_chrono$id_fafg[match(id_cuest, egm_chrono$full_id)] <- n
}

col <- paste0("id_fafg")
df <- fafg_linked

# get all ids

ids <- df$id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

# save in var
assign(col,sapply(l,completion))

Finally, the plot could be generated using ggplot2:

# 4. Plot per cent only lines in white bg

df3 <- data.frame(
  id=rep(0:100,3), 
  rate = c(c(0,id81)*100,c(0,id08)*100, c(0,id_fafg)*100), 
  Origin = c(rep("1978 census", 101), 
             rep("2008 census", 101),
             rep("Victims's list", 101)
  )
)

# for adding points to plot
df3$point <- NA
df3$point[seq(5,303,10)] <- df3$rate[seq(5,303,10)]

# Define colour scheme for plot
suf_col <- c("#990099", "#009900", "#B01824")

df3 %>%
    ggplot(aes(x=id,y=rate, group = Origin
               , shape = Origin
               , colour = Origin
    )
    ) +
    geom_point(aes(y = point), size = 3) +
    geom_line(size = 1) +
    scale_x_continuous("Time (EGM interviews ordered by collection date)",
                       breaks = seq(0, 100, by = 25), limits = c(0,100), expand=c(0.0,0)) +
    scale_y_continuous("Cumulative percent of matched individuals",
                       breaks = seq(25, 100, by = 25), limits = c(0,100), expand=c(0.0,0)
                       , sec.axis = sec_axis(
                         ~./100*3566
                         , name = "Unique individuals in genealogical dataset"
                         , breaks = c(891, 1783, 2674, 3566)
                         , labels = function(br) paste0(round(abs(br)/1000, 1), "k")
                       )
    ) +
    scale_shape_discrete("Matched with",
                         labels = c("1978 census", "2008 census", "FAFG victims") ) +
    scale_color_manual("Matched with", values = suf_col,
                       labels = c("1978 census", "2008 census", "FAFG victims") ) +
    theme_bw() +
    theme(
      legend.position = "bottom"
      , axis.title.y.right = element_text(margin = margin(t = 0, r = 00, b = 0, l = 20))
      , plot.margin = unit(c(1,0,0,0), "cm")
    )

Figure 3. Age-sex structure of the Rio Negro population - Comparing genealogical data with census records

This figure compared the distribution of the Rio Negro population using EGM and census data from 1981 and 2008.

# 1. Define parameters

y_max <- 70
by <- 10
cuts2 <- c(seq(0,y_max,by))
ag_names <- c( paste0(0, "-", by-1), 
               paste( seq(by,y_max - by, by), seq(by + by - 1,y_max, by), sep =  "-"), 
               paste0(y_max, "+"))

# 2. Get data from EGM-derived dataset for both years

# 2.1. 1982: EGM-derived data

rn81 <- census(final, 1981) %>%  dplyr::filter(origin_local == T)

tot <- nrow(rn81)

dist1 <- 
  rn81 %>%
  dplyr::mutate(
    ag = cut(age_at_census, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex2, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 2.2. 2008: EGM-derived data

rn08 <- census(final, 2008) %>%  dplyr::filter(origin_local == T)

tot <- nrow(rn08)

dist2 <- 
  rn08 %>%
  dplyr::mutate(
    ag = cut(age_at_census, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex2, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 3. Get data from local censuses to compare against EGM data

# 3.1. 1981 Rio Negro census

ind81 <- individuals81 %>% dplyr::filter(!is.na(age81) & !is.na(sex) & sex != 3)

tot <- length(ind81$age81)

dist3 <- 
  ind81  %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c(2,1), to = c("Women", "Men")
    ),
    sex = factor(sex, levels = c("Women", "Men")),
    ag = cut(age81, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 3.2. 2008 Rio Negro census

ind08 <- individuals08 %>% dplyr::filter(!is.na(age08) & !is.na(sex) & sex != 3)

tot <- length(ind08$age08)

dist4 <- 
  ind08  %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c(2,1), to = c("Women", "Men")
    ),
    sex = factor(sex, levels = c("Women", "Men")),
    ag = cut(age08, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

All the datasets were then consolidated in a long format:

cuts_y <- seq(-0.20,0.20,0.10)

rn81_df <- dist1 %>%
  dplyr::rename(Sex = sex2) %>%
  ungroup() %>% 
  dplyr::mutate(
    Sex = plyr::mapvalues(
      Sex, from = c("Female", "Male"), to = c("Women", "Men")),
    Sex = factor(Sex, levels = c("Women", "Men"))
  ) %>% 
  dplyr::mutate(
    source = "EGM-reconstructed data",
    year = "1978"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  )

rn08_df <- dist2 %>%
  dplyr::rename(Sex = sex2) %>%
  ungroup() %>% 
  dplyr::mutate(
    Sex = plyr::mapvalues(
      Sex, from = c("Female", "Male"), to = c("Women", "Men")),
    Sex = factor(Sex, levels = c("Women", "Men"))
  ) %>% 
  dplyr::mutate(source = "EGM-reconstructed data",
                year = "2008"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) %>% 
  na.omit

inde81_df  <- dist3 %>%
  dplyr::rename(Sex = sex) %>%
  dplyr::mutate(source = "Local census data",
                year = "1978"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) 

inde08_df  <- dist4 %>%
  dplyr::rename(Sex = sex) %>%
  dplyr::mutate(source = "Local census data",
                year = "2008"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) 

both <-  
  bind_rows(
    rn81_df
    ,rn08_df
    ,inde81_df
    ,inde08_df
  )

The text also makes reference to dependency ratios, which were computed as follows:

# Get df of all age distributions
all_df <- lapply(mget(ls(pattern = "^dist[1-4]")), function(df){ 
  df %>% 
    ungroup() %>% 
    na.omit %>% 
    dplyr::rename(sex = dplyr::starts_with("sex")) %>% 
    select(-n) %>% 
    dplyr::mutate(sex = as.character(sex), ag = as.character(ag)) %>% 
    # Remove all sex
    group_by(ag) %>% 
  dplyr::summarise_if(is.numeric, list(~sum)) %>% 
    dplyr::mutate(share = pop/sum(pop))
}) %>% 
  reduce(left_join, by = c("ag"))

new_cols <- c( "ag", 
   paste(
     c("pop","share"),
     c(rep("rn81",2),rep("rn08",2), rep("inde81",2), rep("inde08", 2)),
     sep = "_"
   )
  
)

colnames(all_df) <- new_cols

dep <- c(1,7,8)
indep <- c(2,3,4,5,6)

dep_df <- all_df %>% 
  select(dplyr::starts_with("pop_"))

ratios <- colSums( dep_df[dep,] ) / colSums( dep_df[indep,] )
ratios <- round(ratios,2)

At this point, it was possible to compute the Whipple index for the four populations of interest, which was later included in the graph:

ages <- list(
  rn81=rn81 %>% pull(age_at_census)
  , rn08=rn08 %>% pull(age_at_census)
  , ind81=ind81 %>% pull(age81)
  , ind08=ind08 %>% pull(age08)
)

w_index <- lapply(ages, function(x) {
  w <- whipple(round(x), printme = F)
  }) %>% 
  unlist %>% 
  as.numeric

w_index <-  ifelse(w_index <105, "<105", 
                   ifelse(w_index >125, ">125", round(w_index, 0)))

# Create labels df

w_df <- data.frame(
  x = rep(levels(both$ag)[7], 4)
  , y = - 150
  , year = rep(unique(both$year, 2)) 
  , source = sort(rep(unique(both$source), 2))
  , label = paste("Whipple:", unlist(w_index, use.names = T), " \n", "Dep. ratio:", ratios)
  , Sex = c("Women")
  )

Finally, this was all put together to generate Figure 3:

both %>%
    ggplot(aes(x=ag, y = pop, label = share)) +
    geom_col(aes(fill=Sex)) +
    geom_label(data = w_df, aes(x = x, y = y, label = label)) +
    geom_text(aes(label = share),
              position = position_nudge(y = ifelse(both$Sex == "Women", -30, 30)),
              size = 4
    ) +
    scale_fill_manual("",labels = c("Women", "Men"), values = suf_col) +
    coord_flip() +
    scale_y_continuous(
      "", breaks = scales::pretty_breaks(n = 6),
      labels = function(br) ifelse(abs(br)>=1000,
                                   paste0(abs(br)/1000, "k"),
                                   abs(br))
    ) +
    xlab("") +
    facet_grid(year~source) +
    # theme_dag_big(size = 20) +
      theme_dag() +
    theme(
      legend.position = 'bottom'
      ,axis.title.x=element_blank()
      )

The text also references other ways in which the EGM-generated data was compared to the local census data as a way of checking for data quality. These included comparing the population distributions for all sexes combined. This was done for the INDE census of 1981:

chisq.test(x = all_df$pop_rn81, p = all_df$share_inde81)

## 
##  Chi-squared test for given probabilities
## 
## data:  all_df$pop_rn81
## X-squared = 9.0435, df = 7, p-value = 0.2495

and for the COCAHICH census of 2008:

chisq.test(x = all_df$pop_rn08, p = all_df$share_inde08)

## 
##  Chi-squared test for given probabilities
## 
## data:  all_df$pop_rn08
## X-squared = 10.989, df = 7, p-value = 0.1391

Figure 4 Age-sex distribution of massacre-derived mortality in Rio Negro reconstructed from EGM data

This figure shows the distribution of excess mortality in the village after the 1982 massacres, bot in absolute and relative terms. We initially define the parameters for the analysis

start <- 0
y_max <- 50

by <- 10
sep <- "-"

cuts2 <- c(seq(start,y_max,by))

ag_names <- c( paste0(start, sep, by-1), 
               paste( seq(by,y_max - by, by), seq(by + by - 1,y_max, by), sep =  sep), 
               paste0(y_max, "+"))

Later on, we estimate the denominator: total population alive in 1981, before the killings, by age and sex:

all <- 
  final %>%
  census(1981) %>%
  dplyr::mutate(
    age_conflict = ifelse(DoB <= 1982, 1982-DoB, NA),
    ag_conflict = cut(age_conflict, c(cuts2, Inf), right = F, labels = ag_names)
  ) %>% 
  dplyr::count(sex2, ag_conflict) %>% 
  na.omit

The following code estimates the number killed in the 1982 massacres and what share of the population this represented:

died <-
  final %>%
  census(1981) %>%
  dplyr::filter(died_conflict) %>% 
  dplyr::mutate(
    age_conflict = ifelse(DoB <= 1982, 1982-DoB, NA),
    ag_conflict = cut(age_conflict, c(cuts2, Inf), right = F, labels = ag_names)
  ) %>% 
  dplyr::count(sex2, ag_conflict) %>% 
  na.omit %>%
  dplyr::mutate(
    share = n/all$n * 100
  ) %>%
  data.frame

died_df <- 
  died %>% 
  dplyr::rename(sex = sex2) %>% 
  reshape2::melt(id = c("sex", "ag_conflict")) %>% 
  dplyr::mutate(
    variable = mapvalues(variable, 
                         from = c("n", "share"), 
                         to = c("All direct conflict deaths (count)", 
                                "Deaths as share of total population (%)")
    )
  ) %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c("Female", "Male"), to = c("Women", "Men")),
    sex = factor(sex, levels = c("Women", "Men"))
  )

Finally, we plot the figure:

died_df %>% 
  ggplot(aes(x = ag_conflict, y = value, group = sex,
             colour = sex, shape = sex)) +
  geom_line() +
  geom_point(size = 4) +
  labs(x="",y="") +
  scale_shape_discrete("Age at conflict") +
  scale_colour_manual("Age at conflict",values = suf_col) +
  facet_grid(.~variable) +
  theme_dag() +
  theme(legend.position = "bottom")

Figure 5. Size of egocentric kinship network in Rio Negro over time by sex and age group (1980-2015)

This figure represents the average number of relatives alive by birth cohort for inhabitants of Rio Negro on a yearly basis. The analysis is relatively straightforward as the yearly number of relatives was estimated beforehand from the final dataset and saved to the yearly_nets data frame, which is used in the code below. The code to create the latter is not included in this vignette but is available upon request.

# Define parameters for analysis

min_year <- 1980

custom_interval <- c(0,15,45)
levels_ordered <- c("0 to 14", "15 to 44", "45+")

nets <-
  merge(
    yearly_nets,
    dplyr::select(final, ego = ind_id, DoB, sex = sex2),
    by = "ego",
    all.x = T
  ) %>%
  dplyr::filter(!is.na(sex)) %>%
  dplyr::filter(year >= min_year) %>%
  dplyr::mutate(
    age_now = year - DoB,
    ag_now = cut(age_now, c(custom_interval, Inf), right = F, labels = levels_ordered),
    ext2 = relatives_alive_ext- relatives_alive_nuc
  ) %>%
  dplyr::select(year, ag_now, sex,nuc = relatives_alive_nuc, ext = ext2) %>%
  group_by(year, ag_now, sex) %>%
  dplyr::summarise(
    nuc = mean(nuc),
    ext = mean(ext)
  ) %>%
  data.frame

Finally, we transform the data to a long format in order to plot it using ggplot2:

fam_size_df <- 
    nets %>%
    dplyr::mutate(
      sex = plyr::mapvalues(
        sex, from = c("Female", "Male"), to = c("Women", "Men")),
      sex = factor(sex, levels = c("Women", "Men"))
    ) %>%
    reshape2::melt(.,id = c("year", "ag_now", "sex")) 
  
fam_size_df    %>%
    ggplot(aes(x = year, y = value, fill = variable)) +
    geom_col(position=position_stack(reverse = T)) +
    scale_fill_manual("", labels = c("Nuclear kin", "Extended kin"), values = suf_col) +
    geom_vline(xintercept = 1982, linetype = "dashed") +
    scale_x_continuous("", breaks = seq(1980, 2010, 10), 
                       labels = c(1980, 90, 2000, 10)) +
    scale_y_continuous("Relatives alive (mean)") +
    facet_grid(sex~ag_now) +
    # theme_dag_big(size = 26) +
      theme_dag() +
    theme(legend.position = "bottom")

Table 1. Data quality and completeness of EGM-generated data: An evaluation of four potential sources of bias

Table 1 shows different measures of data quality for the EGM data. The columns of the table refer to:

The degree of age heaping;
the share of individuals with no reported parents;
the share of individuals for whom the date of birth is unknown

The table is constructed in several steps. The first chunk looks at social distance from the respondent:

rm(list=ls(pattern = "^res[0-9]"))

min_birth_year <- 1960

# 7.1. Social distance 

# 7.1.1. Get social distance from ego to respondent

# Interviews as networks

# split ind.q depending on the household of the interview
inds <- split(ind.q,ind.q$h_id)

# create nodes df

nodes <- lapply(inds, function (x) {
  data.frame(id=unique(x$ind_id), stringsAsFactors = F)
})

# create links  df 

# Transform relational fields to network format
# WEIGHTS: if a parent-son relationship, weight is 1
#          if affinal relationship, weight = 0.5

# get 100 differents dfs with network links
links <- lapply(inds,kin_as_net)

# convert to list of network object 

# d describes the edges of the network. Its first two columns are the IDs of the source and the target 
# node for each edge. The following columns are edge attributes (weight, type, label, or anything else).
# vertices starts with a column of node IDs. Any following columns are interpreted as node attributes.

net <- lapply(1:100, function(x) {
  l <- links[[x]]
  n <- nodes[[x]]
  igraph::graph_from_data_frame(d = l, vertices = n, directed = T)
})

# 7.1.2. Establish kinship between respondent and individuals

ind.q2 <- 
  merge(
    ind.q %>%
      dplyr::mutate(
        id = as.numeric(str_extract(ind_id, "[0-9]{1,3}$") )
      )
    ,
    paradata %>% 
      select(h_id, respondent = respondent1),
    by = "h_id",
    all.x = T
  ) %>% 
  dplyr::mutate(
    respondent = ifelse(is.na(respondent), 1, respondent)
  )

ind.q_l <- split(ind.q2, ind.q2$h_id)

distance_to_respondent <- 
  lapply(1:100, function(n) {
    gr <- net[[n]]
    df <- ind.q_l[[n]]
    
    respondent <- df$respondent[1]
    egos <- df$id
    
    suppressWarnings(
      paths <- shortest_paths(gr, 
                              from = respondent, 
                              to = V(gr)
                              , mode = "all"
      )
    )
    
    sapply(paths$vpath, length)
    
  })

ind.q_temp <- 
  ind.q %>%
  dplyr::mutate(
    distance_to_respondent =  unlist(distance_to_respondent),
    distance_ntile = ntile(distance_to_respondent, 4)
    , DoB_y = year(dmy(DoB))
  ) %>% 
  dplyr::filter(DoB_y >= min_birth_year)

# 1. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(distance_ntile) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 1. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% select(ind_id, idall, dplyr::starts_with("distance")),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(distance_ntile) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 1. results ####

res1 <- 
  merge(
    heaping,
    missing 
  ) %>% 
  dplyr::filter(distance_ntile %in% c(1,4))

The second chunk focuses on comparing outcomes for individuals who were alive or death at the time of the data collection:

# 7.2. Alive vs. death ####

ind.q_temp <- 
  ind.q_temp %>%
  dplyr::mutate(
    alive = ifelse(alive==1, T, F)
  )

# 2. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB) & !is.na(alive)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(alive) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 2. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% 
      dplyr::filter(!is.na(alive)) %>% 
      select(ind_id, idall, alive),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(alive) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = NA
  ) %>%
  data.frame

# 2. results ####

res2 <- 
  merge(
    heaping,
    missing 
  ) %>% 
  arrange(desc(alive))

The third chunk looks at locals and migrants separately:

# 7.3. Local vs. migrant ####

ind.q_temp$origin_local <- 
  ifelse(ind.q_temp$origin_comunidad == "pacux" | ind.q_temp$origin_comunidad == "rio negro", T, F)

# 3. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB) & !is.na(origin_local)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  # group_by(distance_to_respondent) %>% 
  group_by(origin_local) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 3. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% 
      dplyr::filter(!is.na(origin_local)) %>% 
      select(ind_id, idall, origin_local),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(origin_local) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 3. results ####

res3 <- 
  merge(
    heaping,
    missing 
  ) %>%
  arrange(desc(origin_local))

The fourth chunk focuses on the role of the interview length, measured as the number of records produced in each interview:

# 7.4. By interview length ####

int_paradata <- 
  paradata %>% 
  select(h_id, n = no_u_members) %>% 
  dplyr::filter(h_id %in% unique(ind.q_temp$h_id))

int_paradata$n[is.na(int_paradata$n)] <- 
  ind.q_temp %>% 
  dplyr::count(h_id) %>% 
  dplyr::filter(h_id %in% int_paradata$h_id[is.na(int_paradata$n)]) %>%
  pull(n)

int_paradata <- 
  int_paradata %>%
  dplyr::mutate(
    ntile = ntile(n, 4)
  )

# 4.(a) heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(h_id) %>% 
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 4.b-c missing ####
# For this, you need the idall ids

missing <-
  merge(
    ind.q_temp %>% select(ind_id, idall, h_id),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(h_id) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 4. results ####

res4 <- 
  merge(
    merge(
      int_paradata,
      heaping
    ), 
    missing
  ) %>%
  # dplyr::filter(ntile %in% c(1,4)) %>%
  group_by(ntile) %>% 
  dplyr::summarise(
    w_all = mean(w_all),
    w_reproductive = mean(w_reproductive),
    mother_NA = mean(mother_NA),
    DoB_NA = mean(DoB_NA),
    DoD_NA = mean(DoD_NA)
  ) %>% 
  dplyr::rename(length_ntile = ntile) %>%
  dplyr::filter(length_ntile %in% c(1,4))

Putting all these results together, it is possible to generate the main components of Table 1, which were then assembled manually into a tabular format.

dfs <- lapply(mget(ls(pattern = "^res[0-9]")), function(df) {
  df %>% 
    mutate_if(is.numeric, list(~round(., digits = 1)))
})
  
print(dfs)

## $res1
##   distance_ntile w_reproductive w_all mother_NA DoB_NA DoD_NA
## 1              1           87.3  93.6       8.3    2.7    2.5
## 2              4          107.8  96.4      21.6    3.6    1.7
## 
## $res2
##   alive w_reproductive w_all mother_NA DoB_NA DoD_NA
## 1  TRUE          100.2  97.8      15.1    3.8     NA
## 2 FALSE          120.2 114.1      14.1    4.7     NA
## 
## $res3
##   origin_local w_reproductive w_all mother_NA DoB_NA DoD_NA
## 1         TRUE          105.7 100.7       9.1    2.6    1.7
## 2        FALSE           99.5 100.0      26.5    5.8    1.8
## 
## $res4
## # A tibble: 2 x 6
##   length_ntile w_all w_reproductive mother_NA DoB_NA DoD_NA
##          <dbl> <dbl>          <dbl>     <dbl>  <dbl>  <dbl>
## 1            1  104.           114.      16.4    4.7    2.6
## 2            4  102.           104.      13.4    2.9    1.2

Figure 1. Hypothetical genealogy demonstrating the logic of the EGM chain-referral sampling

Finally, although not a result of the study per se, the genealogical diagram presented as Figure 1 in the paper can be reproduced using the kinship2 package in R:

library(kinship2)

ped <- data.frame(
  ped = 1
  , id = 1:11
  , dadid = c(3, NA, NA, 5, NA, NA, 5, 10, 7, NA, NA)
  , momid = c(4, NA, NA, 6, NA, NA, 6, 11, 8, NA, NA)
  , sex = c(1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2)
  , affected = c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)
  , famid = 1 
)

relation <- matrix(c(1,2,4, 1), nrow = 1)

p <- pedigree(
  id = ped$id, 
  dadid = ped$dadid, 
  momid = ped$momid,
  sex = ped$sex, 
  affected = ped$affected,
  famid = ped$ped,
  relation = relation
  ) 

plot.pedigree(p['1']
              , id = LETTERS[1:11]
              , symbolsize = 1
              , mar = c(0,0,0,0)
              , cex = 1
              , pconnect = T
)

Session information

Report by Diego Alburez-Gutierrez - alburezgutierrez[at]demogr.mpg.de; www.alburez.me

## [1] "Report created: 2019-03-14 17:42:29"

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kinship2_1.6.4         quadprog_1.5-5         Matrix_1.2-14         
##  [4] EGM.confidential_0.1.0 knitr_1.21             plyr_1.8.4            
##  [7] lubridate_1.7.4        stringr_1.3.1          purrr_0.3.0           
## [10] dplyr_0.8.0.1          ggplot2_3.1.0          igraph_1.2.2          
## [13] EGM_0.1.0             
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       pillar_1.3.1     compiler_3.5.1   highr_0.7       
##  [5] tools_3.5.1      digest_0.6.18    lattice_0.20-35  evaluate_0.13   
##  [9] tibble_2.0.1     gtable_0.2.0     pkgconfig_2.0.2  rlang_0.3.1     
## [13] cli_1.0.1        yaml_2.2.0       xfun_0.4         withr_2.1.2     
## [17] grid_3.5.1       tidyselect_0.2.5 glue_1.3.0       R6_2.4.0        
## [21] fansi_0.4.0      rmarkdown_1.11   reshape2_1.4.3   magrittr_1.5    
## [25] scales_1.0.0     htmltools_0.3.6  assertthat_0.2.0 colorspace_1.4-0
## [29] labeling_0.3     utf8_1.1.4       stringi_1.2.4    lazyeval_0.2.1  
## [33] munsell_0.5.0    crayon_1.3.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1_reproducible_results

1_reproducible_results

readme_files/figure-markdown_github

readme_files/figure-markdown_github

readme.Rmd

readme.Rmd

readme.md

readme.md

readme.md

Reproducible results and figures

Structure of the data

Reproducing results from the paper

Figure 2. Genealogical saturation estimated by matching EGM-generated records with list of victims and with census records

Figure 3. Age-sex structure of the Rio Negro population - Comparing genealogical data with census records

Figure 4 Age-sex distribution of massacre-derived mortality in Rio Negro reconstructed from EGM data

Figure 5. Size of egocentric kinship network in Rio Negro over time by sex and age group (1980-2015)

Table 1. Data quality and completeness of EGM-generated data: An evaluation of four potential sources of bias

Figure 1. Hypothetical genealogy demonstrating the logic of the EGM chain-referral sampling

Session information

Files

1_reproducible_results

Directory actions

More options

Directory actions

More options

Latest commit

History

1_reproducible_results

Folders and files

parent directory

readme_files/figure-markdown_github

readme_files/figure-markdown_github

readme.Rmd

readme.Rmd

readme.md

readme.md

readme.md

Reproducible results and figures

Structure of the data

Reproducing results from the paper

Figure 2. Genealogical saturation estimated by matching EGM-generated records with list of victims and with census records

Figure 3. Age-sex structure of the Rio Negro population - Comparing genealogical data with census records

Figure 4 Age-sex distribution of massacre-derived mortality in Rio Negro reconstructed from EGM data

Figure 5. Size of egocentric kinship network in Rio Negro over time by sex and age group (1980-2015)

Table 1. Data quality and completeness of EGM-generated data: An evaluation of four potential sources of bias

Figure 1. Hypothetical genealogy demonstrating the logic of the EGM chain-referral sampling

Session information