1_reproducible_results/readme.Rmd

---
title: "Reproducible results and figures"
author: ""
output:
  github_document:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warnings = FALSE, message = FALSE)
```

This document presents the code needed to reproduce the empirical analysis, figures, and tables presented in the paper "Blood is thicker than bloodshed: A genealogical approach to reconstruct populations after armed conflicts". using the data collected in Rio Negro. The package `EGM` is needed to reproduce the results of this study. It can be downloaded from Github using the `devtools` package in R:

```{r, eval = FALSE}
  library(devtools)
  install_github("alburezg/EGM", dep = FALSE)
```

After installation, the package can be loaded, together with the other packages needed for the analysis (see version numbers and other technical details at the end of this document): 

```{r, message=F, warning=F}
library(EGM)

library(igraph)
library(ggplot2)
library(dplyr)
library(purrr)
library(stringr)
library(lubridate)
library(plyr)
library(knitr)
```

```{r, eval=T, include=F, warning=F, message=F}
# This provate package contains the actual data used to reproduce the figures. 
# This package is used to produce the results in this report, but is not publically available given privacy concerns
library(EGM.confidential)
```

## Structure of the data

The microdata data from Rio Negro cannot be shared in full given privacy concerns. The raw data includes sensitive information such as date of birth, date of death, and kinship ties, which could be used to uniquely identify members of the population. Nevertheless, this section provides a brief overview of the seven datasets referenced in the code below. Note that the datasets included in the EGM package are for illustrative purposes and only include the columns referenced in the analysis.

The first dataset, `paradata`, records information data about the EGM interviews themselves, such as household id, unique id of the respondent, number of relatives reported, etc. 

```{r}
paradata %>% head %>% kable
```

The `ind.q` dataset is a joint collection of all individuals reported in the EGM interviews, without removing duplicates. Having the original data with duplicates is useful for studying genealogical saturation and multiple reporting.

```{r}
ind.q %>% head %>% kable
```

The `final` dataset records all members of the community, reconstructed using the EGM method. It is the consolidated and deduplicated version of the `ind.q` dataset.

```{r}
final %>% head %>% kable
```

`yearly_nets` records the number of relatives alive over time for each member of the population (whose id is given in the 'ego' column). The data is a transformation of the `final` dataset.

```{r}
yearly_nets %>% tail %>% kable
```

The analysis below also uses data from secondary sources. This was used to check the quality of the demographic information collected using the EGM. The `individuals81` dataset records the 1978 census conducted by INDE power company in Rio Negro before the killings. (It is referred to as '81' because the census report was published in 1981.)

```{r}
individuals81 %>% head %>% kable
```

The `individuals08` dataset records the 2008 local census conducted by COCAHICH in Rio Negro and Pacux.

```{r}
individuals08 %>% head %>% kable
```

The `fafg_linked` dataset is a transformed version of the list of Rio Negro massacre victims made available by the Guatemalan Forensic Anthropology Foundation (FAFG) ([link](http://fafg.org/bd/l_victima2.php?boton=1&fuente=2&limit_i_victima=0&dep_id_muni%5B%5D=26&id_muni%5B%5D=282&lugar_lp=&no_pager=1)). The data was linked with the EGM-generated records using record linkage methods (the linkage code is not included in the script). 

```{r}
fafg_linked %>% head %>% kable
```

# Reproducing results from the paper

In what follows, I show how these datasets were analysed to produce the results reported in the paper. Note that the code will not work using only the sample datasets included in the EGM package. Resarchers interested in replicating any particular section of the analysis are encouraged to contact the author. 

## Figure 2. Genealogical saturation estimated by matching EGM-generated records with list of victims and with census records

This graph shows the share of records in independent population registers that were recorded in the EGM-generated data. In order to obtain this, it was first necessary to recode the EGM-generated data chronologically by the date in which the interview was conducted:

```{r}

egm_chrono <- ind.q %>% 
  select(int_id = h_id, interview_date, full_id, id81, id08, idall) %>% 
  dplyr::arrange(as.Date(ind.q$interview_date, "%d/%m/%Y"), as.numeric(ind.q$h_id))

int_id_old <- unique(egm_chrono$int_id) 

egm_chrono$int_id <- as.numeric(plyr::mapvalues(egm_chrono$int_id, from = int_id_old, to = 1:length(int_id_old)))

int_id_new <- unique(egm_chrono$int_id) 
```

Afterwards, the EGM data was matched against the two local censuses available fro Rio Negro:

```{r}

# 2.1. Match with 1978 census

col <- paste0("id",81)
df <- individuals81

# get all ids

ids <- df$ind_id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

# save in var
assign(col,sapply(l,completion))

# 2.2. Match with 2008 census

col <- paste0("id","08")
df <- individuals08

# get all ids

ids <- df$ind_id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

assign(col,sapply(l,completion))

# Number of records in each interview

l <- split(egm_chrono[,col],egm_chrono$int_id)
len <- sapply(l,length) # output as vector

# rescale to 0-0.5
len <- (len/max(len))/2

# 2.3. Match with "final" df

# length of ind.all
len <- nrow(final) 

col <- "idall"

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

idall <- unlist(l)

cuts <- unlist(lapply(l,length))

quest <- rep(unique(egm_chrono$int_id), cuts)

df <- data.frame(quest,idall)

matched <- rep(NA, len)

for(n in seq_along(df$idall)) {
  line <- match(df$idall[n], final$ind_id)
  if (is.na(matched[line]))
    matched[line] <- df$quest[n]
}

new_records <- as.numeric(as.matrix(table(matched)))

not <- int_id_new[!( int_id_new %in% unique(matched) )]
pos <- match(not,int_id_new)

new_records <- append(new_records, 0, pos[1])
new_records <- append(new_records, 0, pos[2])

idall_or <- cumsum(new_records)

idall <- idall_or/len 
```

After this, the EGM-records were matched against another independent source of demographic data: lists of victims compiled by the FAFG:

```{r}
# 3. Match with FAFG Data

# Assign links to ind.q

egm_chrono$id_fafg <- NA

for(n in 1:nrow(fafg_linked)) {
  id_cuest <- unlist( strsplit(fafg_linked$id_cuest[n], ";") )
  egm_chrono$id_fafg[match(id_cuest, egm_chrono$full_id)] <- n
}

col <- paste0("id_fafg")
df <- fafg_linked

# get all ids

ids <- df$id
len <- length(ids) # original number of records

# save all links in all questionnaires in a list

l <- split(egm_chrono[,col],egm_chrono$int_id)

l <- lapply(l,function(x) x[!is.na(x)])

# save in var
assign(col,sapply(l,completion))
```

Finally, the plot could be generated using `ggplot2`:

```{r, fig.height = 4, fig.width = 6, fig.align = "center", warning=F}

# 4. Plot per cent only lines in white bg

df3 <- data.frame(
  id=rep(0:100,3), 
  rate = c(c(0,id81)*100,c(0,id08)*100, c(0,id_fafg)*100), 
  Origin = c(rep("1978 census", 101), 
             rep("2008 census", 101),
             rep("Victims's list", 101)
  )
)

# for adding points to plot
df3$point <- NA
df3$point[seq(5,303,10)] <- df3$rate[seq(5,303,10)]

# Define colour scheme for plot
suf_col <- c("#990099", "#009900", "#B01824")

df3 %>%
    ggplot(aes(x=id,y=rate, group = Origin
               , shape = Origin
               , colour = Origin
    )
    ) +
    geom_point(aes(y = point), size = 3) +
    geom_line(size = 1) +
    scale_x_continuous("Time (EGM interviews ordered by collection date)",
                       breaks = seq(0, 100, by = 25), limits = c(0,100), expand=c(0.0,0)) +
    scale_y_continuous("Cumulative percent of matched individuals",
                       breaks = seq(25, 100, by = 25), limits = c(0,100), expand=c(0.0,0)
                       , sec.axis = sec_axis(
                         ~./100*3566
                         , name = "Unique individuals in genealogical dataset"
                         , breaks = c(891, 1783, 2674, 3566)
                         , labels = function(br) paste0(round(abs(br)/1000, 1), "k")
                       )
    ) +
    scale_shape_discrete("Matched with",
                         labels = c("1978 census", "2008 census", "FAFG victims") ) +
    scale_color_manual("Matched with", values = suf_col,
                       labels = c("1978 census", "2008 census", "FAFG victims") ) +
    theme_bw() +
    theme(
      legend.position = "bottom"
      , axis.title.y.right = element_text(margin = margin(t = 0, r = 00, b = 0, l = 20))
      , plot.margin = unit(c(1,0,0,0), "cm")
    )

```

## Figure 3. Age-sex structure of the Rio Negro population - Comparing genealogical data with census records

This figure compared the distribution of the Rio Negro population using EGM and census data from 1981 and 2008.

```{r}
# 1. Define parameters

y_max <- 70
by <- 10
cuts2 <- c(seq(0,y_max,by))
ag_names <- c( paste0(0, "-", by-1), 
               paste( seq(by,y_max - by, by), seq(by + by - 1,y_max, by), sep =  "-"), 
               paste0(y_max, "+"))

# 2. Get data from EGM-derived dataset for both years

# 2.1. 1982: EGM-derived data

rn81 <- census(final, 1981) %>%  dplyr::filter(origin_local == T)

tot <- nrow(rn81)

dist1 <- 
  rn81 %>%
  dplyr::mutate(
    ag = cut(age_at_census, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex2, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 2.2. 2008: EGM-derived data

rn08 <- census(final, 2008) %>%  dplyr::filter(origin_local == T)

tot <- nrow(rn08)

dist2 <- 
  rn08 %>%
  dplyr::mutate(
    ag = cut(age_at_census, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex2, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 3. Get data from local censuses to compare against EGM data

# 3.1. 1981 Rio Negro census

ind81 <- individuals81 %>% dplyr::filter(!is.na(age81) & !is.na(sex) & sex != 3)

tot <- length(ind81$age81)

dist3 <- 
  ind81  %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c(2,1), to = c("Women", "Men")
    ),
    sex = factor(sex, levels = c("Women", "Men")),
    ag = cut(age81, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )

# 3.2. 2008 Rio Negro census

ind08 <- individuals08 %>% dplyr::filter(!is.na(age08) & !is.na(sex) & sex != 3)

tot <- length(ind08$age08)

dist4 <- 
  ind08  %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c(2,1), to = c("Women", "Men")
    ),
    sex = factor(sex, levels = c("Women", "Men")),
    ag = cut(age08, c(cuts2, Inf), include.lowest= T, right = F, labels = ag_names)
  ) %>%
  dplyr::group_by(sex, ag) %>%
  dplyr::summarise(
    pop = n(),
    n = pop / tot
  )
```

All the datasets were then consolidated in a long format:

```{r}
cuts_y <- seq(-0.20,0.20,0.10)

rn81_df <- dist1 %>%
  dplyr::rename(Sex = sex2) %>%
  ungroup() %>% 
  dplyr::mutate(
    Sex = plyr::mapvalues(
      Sex, from = c("Female", "Male"), to = c("Women", "Men")),
    Sex = factor(Sex, levels = c("Women", "Men"))
  ) %>% 
  dplyr::mutate(
    source = "EGM-reconstructed data",
    year = "1978"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  )

rn08_df <- dist2 %>%
  dplyr::rename(Sex = sex2) %>%
  ungroup() %>% 
  dplyr::mutate(
    Sex = plyr::mapvalues(
      Sex, from = c("Female", "Male"), to = c("Women", "Men")),
    Sex = factor(Sex, levels = c("Women", "Men"))
  ) %>% 
  dplyr::mutate(source = "EGM-reconstructed data",
                year = "2008"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) %>% 
  na.omit

inde81_df  <- dist3 %>%
  dplyr::rename(Sex = sex) %>%
  dplyr::mutate(source = "Local census data",
                year = "1978"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) 

inde08_df  <- dist4 %>%
  dplyr::rename(Sex = sex) %>%
  dplyr::mutate(source = "Local census data",
                year = "2008"
  )%>%
  dplyr::mutate(
    pop = ifelse(Sex=="Women", pop*(-1), pop*1),
    n = ifelse(Sex=="Women", n*(-1), n*1),
    share = paste0(abs(round(n*100,1)), "%")
  ) 

both <-  
  bind_rows(
    rn81_df
    ,rn08_df
    ,inde81_df
    ,inde08_df
  )
```

The text also makes reference to dependency ratios, which were computed as follows:

```{r}

# Get df of all age distributions
all_df <- lapply(mget(ls(pattern = "^dist[1-4]")), function(df){ 
  df %>% 
    ungroup() %>% 
    na.omit %>% 
    dplyr::rename(sex = dplyr::starts_with("sex")) %>% 
    select(-n) %>% 
    dplyr::mutate(sex = as.character(sex), ag = as.character(ag)) %>% 
    # Remove all sex
    group_by(ag) %>% 
  dplyr::summarise_if(is.numeric, list(~sum)) %>% 
    dplyr::mutate(share = pop/sum(pop))
}) %>% 
  reduce(left_join, by = c("ag"))

new_cols <- c( "ag", 
   paste(
     c("pop","share"),
     c(rep("rn81",2),rep("rn08",2), rep("inde81",2), rep("inde08", 2)),
     sep = "_"
   )
  
)

colnames(all_df) <- new_cols

dep <- c(1,7,8)
indep <- c(2,3,4,5,6)

dep_df <- all_df %>% 
  select(dplyr::starts_with("pop_"))

ratios <- colSums( dep_df[dep,] ) / colSums( dep_df[indep,] )
ratios <- round(ratios,2)
```

At this point, it was possible to compute the Whipple index for the four populations of interest, which was later included in the graph:

```{r}
   
ages <- list(
  rn81=rn81 %>% pull(age_at_census)
  , rn08=rn08 %>% pull(age_at_census)
  , ind81=ind81 %>% pull(age81)
  , ind08=ind08 %>% pull(age08)
)

w_index <- lapply(ages, function(x) {
  w <- whipple(round(x), printme = F)
  }) %>% 
  unlist %>% 
  as.numeric

w_index <-  ifelse(w_index <105, "<105", 
                   ifelse(w_index >125, ">125", round(w_index, 0)))

# Create labels df

w_df <- data.frame(
  x = rep(levels(both$ag)[7], 4)
  , y = - 150
  , year = rep(unique(both$year, 2)) 
  , source = sort(rep(unique(both$source), 2))
  , label = paste("Whipple:", unlist(w_index, use.names = T), " \n", "Dep. ratio:", ratios)
  , Sex = c("Women")
  )
```

Finally, this was all put together to generate Figure 3: 

```{r, fig.height = 5, fig.width = 9, fig.align = "center", warning=F}
  
both %>%
    ggplot(aes(x=ag, y = pop, label = share)) +
    geom_col(aes(fill=Sex)) +
    geom_label(data = w_df, aes(x = x, y = y, label = label)) +
    geom_text(aes(label = share),
              position = position_nudge(y = ifelse(both$Sex == "Women", -30, 30)),
              size = 4
    ) +
    scale_fill_manual("",labels = c("Women", "Men"), values = suf_col) +
    coord_flip() +
    scale_y_continuous(
      "", breaks = scales::pretty_breaks(n = 6),
      labels = function(br) ifelse(abs(br)>=1000,
                                   paste0(abs(br)/1000, "k"),
                                   abs(br))
    ) +
    xlab("") +
    facet_grid(year~source) +
    # theme_dag_big(size = 20) +
      theme_dag() +
    theme(
      legend.position = 'bottom'
      ,axis.title.x=element_blank()
      )

```

The text also references other ways in which the EGM-generated data was compared to the local census data as a way of checking for data quality. These included comparing the population distributions for all sexes combined. This was done for the INDE census of 1981:

```{r}
chisq.test(x = all_df$pop_rn81, p = all_df$share_inde81)
```

and for the COCAHICH census of 2008:

```{r}
chisq.test(x = all_df$pop_rn08, p = all_df$share_inde08)
```

## Figure 4 Age-sex distribution of massacre-derived mortality in Rio Negro reconstructed from EGM data

This figure shows the distribution of excess mortality in the village after the 1982 massacres, bot in absolute and relative terms. We initially define the parameters for the analysis

```{r}

start <- 0
y_max <- 50

by <- 10
sep <- "-"

cuts2 <- c(seq(start,y_max,by))

ag_names <- c( paste0(start, sep, by-1), 
               paste( seq(by,y_max - by, by), seq(by + by - 1,y_max, by), sep =  sep), 
               paste0(y_max, "+"))
```

Later on, we estimate the denominator: total population alive in 1981, before the killings, by age and sex:

```{r}
all <- 
  final %>%
  census(1981) %>%
  dplyr::mutate(
    age_conflict = ifelse(DoB <= 1982, 1982-DoB, NA),
    ag_conflict = cut(age_conflict, c(cuts2, Inf), right = F, labels = ag_names)
  ) %>% 
  dplyr::count(sex2, ag_conflict) %>% 
  na.omit
```

The following code estimates the number killed in the 1982 massacres and what share of the population this represented:

```{r}
died <-
  final %>%
  census(1981) %>%
  dplyr::filter(died_conflict) %>% 
  dplyr::mutate(
    age_conflict = ifelse(DoB <= 1982, 1982-DoB, NA),
    ag_conflict = cut(age_conflict, c(cuts2, Inf), right = F, labels = ag_names)
  ) %>% 
  dplyr::count(sex2, ag_conflict) %>% 
  na.omit %>%
  dplyr::mutate(
    share = n/all$n * 100
  ) %>%
  data.frame

died_df <- 
  died %>% 
  dplyr::rename(sex = sex2) %>% 
  reshape2::melt(id = c("sex", "ag_conflict")) %>% 
  dplyr::mutate(
    variable = mapvalues(variable, 
                         from = c("n", "share"), 
                         to = c("All direct conflict deaths (count)", 
                                "Deaths as share of total population (%)")
    )
  ) %>% 
  dplyr::mutate(
    sex = plyr::mapvalues(
      sex, from = c("Female", "Male"), to = c("Women", "Men")),
    sex = factor(sex, levels = c("Women", "Men"))
  ) 
```

Finally, we plot the figure:

```{r, fig.height = 5, fig.width = 8, fig.align = "center"}  
died_df %>% 
  ggplot(aes(x = ag_conflict, y = value, group = sex,
             colour = sex, shape = sex)) +
  geom_line() +
  geom_point(size = 4) +
  labs(x="",y="") +
  scale_shape_discrete("Age at conflict") +
  scale_colour_manual("Age at conflict",values = suf_col) +
  facet_grid(.~variable) +
  theme_dag() +
  theme(legend.position = "bottom")
```

## Figure 5. Size of egocentric kinship network in Rio Negro over time by sex and age group (1980-2015)

This figure represents the average number of relatives alive by birth cohort for inhabitants of Rio Negro on a yearly basis. The analysis is relatively straightforward as the yearly number of relatives was estimated beforehand from the `final` dataset and saved to the `yearly_nets` data frame, which is used in the code below. The code to create the latter is not included in this vignette but is available upon request. 

```{r}
# Define parameters for analysis

min_year <- 1980

custom_interval <- c(0,15,45)
levels_ordered <- c("0 to 14", "15 to 44", "45+")

nets <-
  merge(
    yearly_nets,
    dplyr::select(final, ego = ind_id, DoB, sex = sex2),
    by = "ego",
    all.x = T
  ) %>%
  dplyr::filter(!is.na(sex)) %>%
  dplyr::filter(year >= min_year) %>%
  dplyr::mutate(
    age_now = year - DoB,
    ag_now = cut(age_now, c(custom_interval, Inf), right = F, labels = levels_ordered),
    ext2 = relatives_alive_ext- relatives_alive_nuc
  ) %>%
  dplyr::select(year, ag_now, sex,nuc = relatives_alive_nuc, ext = ext2) %>%
  group_by(year, ag_now, sex) %>%
  dplyr::summarise(
    nuc = mean(nuc),
    ext = mean(ext)
  ) %>%
  data.frame
```

Finally, we transform the data to a long format in order to plot it using `ggplot2`:

```{r, fig.height = 5, fig.width = 6, fig.align = "center"}
fam_size_df <- 
    nets %>%
    dplyr::mutate(
      sex = plyr::mapvalues(
        sex, from = c("Female", "Male"), to = c("Women", "Men")),
      sex = factor(sex, levels = c("Women", "Men"))
    ) %>%
    reshape2::melt(.,id = c("year", "ag_now", "sex")) 
  
fam_size_df    %>%
    ggplot(aes(x = year, y = value, fill = variable)) +
    geom_col(position=position_stack(reverse = T)) +
    scale_fill_manual("", labels = c("Nuclear kin", "Extended kin"), values = suf_col) +
    geom_vline(xintercept = 1982, linetype = "dashed") +
    scale_x_continuous("", breaks = seq(1980, 2010, 10), 
                       labels = c(1980, 90, 2000, 10)) +
    scale_y_continuous("Relatives alive (mean)") +
    facet_grid(sex~ag_now) +
    # theme_dag_big(size = 26) +
      theme_dag() +
    theme(legend.position = "bottom")
```

## Table 1. Data quality and completeness of EGM-generated data: An evaluation of four potential sources of bias

Table 1 shows different measures of data quality for the EGM data. The columns of the table refer to:

a. The degree of age heaping; 
b. the share of individuals with no reported parents; 
c. the share of individuals for whom the date of birth is unknown

The table is constructed in several steps. The first chunk looks at social distance from the respondent:

```{r}
rm(list=ls(pattern = "^res[0-9]"))

min_birth_year <- 1960

# 7.1. Social distance 

# 7.1.1. Get social distance from ego to respondent

# Interviews as networks

# split ind.q depending on the household of the interview
inds <- split(ind.q,ind.q$h_id)

# create nodes df

nodes <- lapply(inds, function (x) {
  data.frame(id=unique(x$ind_id), stringsAsFactors = F)
})

# create links  df 

# Transform relational fields to network format
# WEIGHTS: if a parent-son relationship, weight is 1
#          if affinal relationship, weight = 0.5

# get 100 differents dfs with network links
links <- lapply(inds,kin_as_net)

# convert to list of network object 

# d describes the edges of the network. Its first two columns are the IDs of the source and the target 
# node for each edge. The following columns are edge attributes (weight, type, label, or anything else).
# vertices starts with a column of node IDs. Any following columns are interpreted as node attributes.

net <- lapply(1:100, function(x) {
  l <- links[[x]]
  n <- nodes[[x]]
  igraph::graph_from_data_frame(d = l, vertices = n, directed = T)
})

# 7.1.2. Establish kinship between respondent and individuals

ind.q2 <- 
  merge(
    ind.q %>%
      dplyr::mutate(
        id = as.numeric(str_extract(ind_id, "[0-9]{1,3}$") )
      )
    ,
    paradata %>% 
      select(h_id, respondent = respondent1),
    by = "h_id",
    all.x = T
  ) %>% 
  dplyr::mutate(
    respondent = ifelse(is.na(respondent), 1, respondent)
  )

ind.q_l <- split(ind.q2, ind.q2$h_id)

distance_to_respondent <- 
  lapply(1:100, function(n) {
    gr <- net[[n]]
    df <- ind.q_l[[n]]
    
    respondent <- df$respondent[1]
    egos <- df$id
    
    suppressWarnings(
      paths <- shortest_paths(gr, 
                              from = respondent, 
                              to = V(gr)
                              , mode = "all"
      )
    )
    
    sapply(paths$vpath, length)
    
  })

ind.q_temp <- 
  ind.q %>%
  dplyr::mutate(
    distance_to_respondent =  unlist(distance_to_respondent),
    distance_ntile = ntile(distance_to_respondent, 4)
    , DoB_y = year(dmy(DoB))
  ) %>% 
  dplyr::filter(DoB_y >= min_birth_year)

# 1. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(distance_ntile) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 1. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% select(ind_id, idall, dplyr::starts_with("distance")),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(distance_ntile) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 1. results ####

res1 <- 
  merge(
    heaping,
    missing 
  ) %>% 
  dplyr::filter(distance_ntile %in% c(1,4))
```

The second chunk focuses on comparing outcomes for individuals who were alive or death at the time of the data collection:

```{r}
# 7.2. Alive vs. death ####

ind.q_temp <- 
  ind.q_temp %>%
  dplyr::mutate(
    alive = ifelse(alive==1, T, F)
  )

# 2. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB) & !is.na(alive)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(alive) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 2. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% 
      dplyr::filter(!is.na(alive)) %>% 
      select(ind_id, idall, alive),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(alive) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = NA
  ) %>%
  data.frame

# 2. results ####

res2 <- 
  merge(
    heaping,
    missing 
  ) %>% 
  arrange(desc(alive))
```

The third chunk looks at locals and migrants separately:

```{r}
# 7.3. Local vs. migrant ####

ind.q_temp$origin_local <- 
  ifelse(ind.q_temp$origin_comunidad == "pacux" | ind.q_temp$origin_comunidad == "rio negro", T, F)

# 3. a heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB) & !is.na(origin_local)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  # group_by(distance_to_respondent) %>% 
  group_by(origin_local) %>%
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 3. b-c missing ####

missing <-
  merge(
    ind.q_temp %>% 
      dplyr::filter(!is.na(origin_local)) %>% 
      select(ind_id, idall, origin_local),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(origin_local) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 3. results ####

res3 <- 
  merge(
    heaping,
    missing 
  ) %>%
  arrange(desc(origin_local))
```

The fourth chunk focuses on the role of the interview length, measured as the number of records produced in each interview:

```{r}
# 7.4. By interview length ####

int_paradata <- 
  paradata %>% 
  select(h_id, n = no_u_members) %>% 
  dplyr::filter(h_id %in% unique(ind.q_temp$h_id))

int_paradata$n[is.na(int_paradata$n)] <- 
  ind.q_temp %>% 
  dplyr::count(h_id) %>% 
  dplyr::filter(h_id %in% int_paradata$h_id[is.na(int_paradata$n)]) %>%
  pull(n)

int_paradata <- 
  int_paradata %>%
  dplyr::mutate(
    ntile = ntile(n, 4)
  )

# 4.(a) heaping ####

heaping <- 
  ind.q_temp %>% 
  dplyr::filter(!is.na(DoB)) %>% 
  dplyr::mutate(
    DoB_y = year(dmy(DoB)),
    age_fieldwork = 2016-DoB_y
  ) %>%
  group_by(h_id) %>% 
  dplyr::summarise(
    w_reproductive = 
      round(whipple(age_fieldwork,min_age = 18, max_age = 47, printme = F), 1),
    w_all = 
      round(whipple(age_fieldwork,min_age = 23, max_age = 62, printme = F), 1)
  ) %>%
  data.frame

# 4.b-c missing ####
# For this, you need the idall ids

missing <-
  merge(
    ind.q_temp %>% select(ind_id, idall, h_id),
    final %>% 
      dplyr::mutate(
        mother_NA = ifelse(is.na(mother), T, F),
        DoB_NA = ifelse(is.na(DoB), T, F),
        DoD_NA = ifelse(is.na(DoD_with_alive), T, F)
      ) %>% 
      select(ind_id, dplyr::ends_with("_NA")),
    by.x = "idall",
    by.y = "ind_id",
    all.x = T
  ) %>% 
  group_by(h_id) %>%
  dplyr::summarise(
    mother_NA = round(sum(mother_NA, na.rm = T)/n()*100, 1),
    DoB_NA = round(sum(DoB_NA, na.rm = T)/n()*100, 1),
    DoD_NA = round(sum(DoD_NA, na.rm = T)/n()*100, 1)
  ) %>%
  data.frame

# 4. results ####

res4 <- 
  merge(
    merge(
      int_paradata,
      heaping
    ), 
    missing
  ) %>%
  # dplyr::filter(ntile %in% c(1,4)) %>%
  group_by(ntile) %>% 
  dplyr::summarise(
    w_all = mean(w_all),
    w_reproductive = mean(w_reproductive),
    mother_NA = mean(mother_NA),
    DoB_NA = mean(DoB_NA),
    DoD_NA = mean(DoD_NA)
  ) %>% 
  dplyr::rename(length_ntile = ntile) %>%
  dplyr::filter(length_ntile %in% c(1,4))
```

Putting all these results together, it is possible to generate the main components of Table 1, which were then assembled manually into a tabular format.

```{r} 
dfs <- lapply(mget(ls(pattern = "^res[0-9]")), function(df) {
  df %>% 
    mutate_if(is.numeric, list(~round(., digits = 1)))
})
  
print(dfs)
  
```

## Figure 1. Hypothetical genealogy demonstrating the logic of the EGM chain-referral sampling

Finally, although not a result of the study per se, the genealogical diagram presented as Figure 1 in the paper can be reproduced using the `kinship2` package in R:

```{r, fig.height = 5, fig.width = 6, fig.align = "center", message=F, warning=F}
library(kinship2)

ped <- data.frame(
  ped = 1
  , id = 1:11
  , dadid = c(3, NA, NA, 5, NA, NA, 5, 10, 7, NA, NA)
  , momid = c(4, NA, NA, 6, NA, NA, 6, 11, 8, NA, NA)
  , sex = c(1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2)
  , affected = c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0)
  , famid = 1 
)

relation <- matrix(c(1,2,4, 1), nrow = 1)

p <- pedigree(
  id = ped$id, 
  dadid = ped$dadid, 
  momid = ped$momid,
  sex = ped$sex, 
  affected = ped$affected,
  famid = ped$ped,
  relation = relation
  ) 

plot.pedigree(p['1']
              , id = LETTERS[1:11]
              , symbolsize = 1
              , mar = c(0,0,0,0)
              , cex = 1
              , pconnect = T
)
```

# Session information

Report by Diego Alburez-Gutierrez - alburezgutierrez[at]demogr.mpg.de; [www.alburez.me](http://alburez.me)

```{r, echo=FALSE, warning=FALSE}
print(paste("Report created:", Sys.time()))
sessionInfo()
```