colocation.Rmd

---
title: "Colocation data"
output:
  html_document:
    toc: yes
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::knit_hooks$set(
  margin1 = function(before, options, envir) {
    if (before) par(mgp = c(1.5, .5, 0), bty = "n", plt = c(.105, .97, .13, .97))
    else NULL
  },
  margin2 = function(before, options, envir) {
    if (before) par(mgp = c(2, .5, 0), bty = "n", plt = c(.105, .97, .13, .97))
    else NULL
  },
  margin3 = function(before, options, envir) {
    if (before) par(mgp = c(1.5, .5, 0), bty = "n", mai = rep(.1, 4))
    else NULL
  }
)

knitr::opts_chunk$set(echo       = TRUE,
                      cache      = TRUE,
                      margin1    = TRUE,
                      fig.retina = 2,
                      fig.align  = "center")

l <- "en_US.UTF-8"
Sys.setenv(LANGAGE = l)
Sys.setlocale(locale = l)
Sys.setlocale("LC_MESSAGES", l)
```

## Links to local data sets

Here we use the following links to data.

```{r}
census_path <- "~/Dropbox/aaa/studies/oucomo/clean data/census2019.rds"
colocation_path <- "~/Dropbox/aaa/studies/data for good/data/geoinsights/colocation maps/"
```

Change them accordingly if you want to run the script locally on your computer.

## Preambule

This document shows how to combine 3 data sets:

* the colocation dataset from the Facebook's
[Data for Good](https://dataforgood.fb.com) initiative;
* the geographical polygons from [GADM](https://gadm.org) (UC Davis);
* the 2019 population census data of Vietnam from [GSO](https://www.gso.gov.vn).

These three data sets provide information by district (roughly 700 in Vietnam).
The difficulties come from the fact that these 3 datasets listed above do not
have exactly the same districts definitions. Indeed, as the population grows
with time, districts tend to split. The problems we had to deal with here are

* one province and a few other districts are missing from the 2019 census data.
Missing value were completed by hand with data from
[Wikipedia](https://en.wikipedia.org/wiki/Vĩnh_Long_Province#Administrative_divisions).
* the polygon of one district is missing from GADM. This was completed by hand
with data from [OpenStreetMap](https://www.openstreetmap.org).
* the colocation dataset uses the district definition from
[Bing](https://www.bing.com/maps) that dates from at least 10 years back,
meaning that we have to merge a number of districts in both GADM and the census
data in order to make them consistent with the definition used in the colocation
data. In some cases, we had to split a district before merging the 2 parts to 2
different districts. In order to split the population accurately, we used also
population density raster data from [WorldPop](https://www.worldpop.org)
(University of Southampton).

## Packages

The needed packages:

```{r message = FALSE}
library(sf)
library(stars)
library(osmdata)
library(magrittr)
library(stringr)
library(lubridate)
library(tidyr)
library(purrr)
library(dplyr) # best to load last
```

Redefining the `hist()` function:

```{r}
hist2 <- function(...) hist(..., main = NA)
```

## Population density raster data

Downloading the population density data from [WorldPop](https://www.worldpop.org):

```{r eval = FALSE}
download.file("ftp://ftp.worldpop.org.uk/GIS/Population/Individual_countries/VNM/Viet_Nam_100m_Population/VNM_ppp_v2b_2020_UNadj.tif",
              "VNM_ppp_v2b_2020_UNadj.tif")
```

Loading the data:

```{r}
worldpop <- read_stars("VNM_ppp_v2b_2020_UNadj.tif")
```

## Google and Apple mobility data

Downloading the population mobility data:

```{r eval = FALSE}
download.file("https://www.dropbox.com/s/6fl62gcuma9890f/google.rds?raw=1", "google.rds")
download.file("https://www.dropbox.com/s/uuxxjm3cgs0a4gw/apple.rds?raw=1", "apple.rds")
```

Loading the data:

```{r eval = FALSE}
google <- readRDS("google.rds")
apple <- readRDS("apple.rds") %>% 
  mutate_if(is.numeric, subtract, 100)
```

## Polygons from GADM

Downloading the polygons from [GADM](https://gadm.org):

```{r eval = FALSE}
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_0_sf.rds", "gadm36_VNM_0_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_1_sf.rds", "gadm36_VNM_1_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_VNM_2_sf.rds", "gadm36_VNM_2_sf.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm2.rds"       , "VNM_adm2.rds")
download.file("https://biogeo.ucdavis.edu/data/gadm2.8/rds/VNM_adm3.rds"       , "VNM_adm3.rds")
```

Loading the polygons:

```{r}
vn0 <- readRDS("gadm36_VNM_0_sf.rds")     # country polygon
vn1 <- readRDS("gadm36_VNM_1_sf.rds")     # provinces polygons

vn2 <- readRDS("gadm36_VNM_2_sf.rds") %>% # districts polygons
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>%
              str_remove_all("Thành Phố |Thị Xã |Quận "))

vn2_old <- readRDS("VNM_adm2.rds") %>%    # old version of the districts polygons
  st_as_sf() %>% 
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>% 
              str_remove_all("Thành Phố |Thị Xã |Quận ") %>% 
              str_replace("Chiêm Hoá", "Chiêm Hóa"))

vn3_old <- readRDS("VNM_adm3.rds") %>%    # old version of the communes polygons
  st_as_sf() %>% 
  transmute(province = str_squish(NAME_1),
            district = str_squish(NAME_2) %>% 
              str_remove_all("Thành Phố |Thị Xã |Quận ") %>% 
              str_replace("Chiêm Hoá", "Chiêm Hóa"),
            commune = str_squish(NAME_3)) %>% 
  arrange(province, district, commune)
```

Removing the commune Huæi Luông from the district Sìn Hồ:

```{r}
vn2_old %<>% 
  filter(district == "Sìn Hồ") %>% 
  st_set_geometry(st_union(filter(vn3_old, district == "Sìn Hồ", commune != "Huæi Luông"))) %>% 
  rbind(filter(vn2_old, district != "Sìn Hồ")) %>% 
  arrange(province, district)
```

## Adding the polygon for Côn Đảo from OpenStreetMap

Downloading the administrative boundaries from [OpenStreetMap](https://www.openstreetmap.org):

```{r eval = FALSE}
con_dao <- c(106.523384, 8.622214, 106.748218, 8.782639) %>%
  opq() %>% 
  add_osm_feature(key = "boundary", value = "administrative") %>% 
  osmdata_sf()
```

```{r include = FALSE, eval = FALSE}
saveRDS(con_dao, "con_dao.rds")
```

```{r include = FALSE}
con_dao <- readRDS("con_dao.rds")
```

The main island is made of lines. Let's transform them into a polygon:

```{r}
main_island <- con_dao$osm_lines %>%
  st_combine() %>%
  st_polygonize() %>%
  st_geometry() %>%
  st_collection_extract("POLYGON")
```

The other islands are already polygons. Let's extract and merge them with the
newly created polygon of the main island:

```{r}
con_dao <- con_dao$osm_polygons %>%
  st_geometry() %>% 
  c(main_island) %>% 
  st_multipolygon() %>% 
  st_sfc(crs = st_crs(vn2))
```

Here are the polygons for Côn Đảo:

```{r}
con_dao %>% 
  st_geometry() %>% 
  plot(col = "grey")
```

Let's add the polygon of Côn Đảo to the GADM data:

```{r}
vn2 %<>%
  filter(province == "Bà Rịa - Vũng Tàu") %>% 
  head(1) %>% 
  mutate(district = "Côn Đảo") %>% 
  st_set_geometry(con_dao) %>% 
  rbind(vn2)
```

## Adding the census 2019 data

For some reason the province of `Vĩnh Long` is missing... We add information
form [Wikipedia](https://en.wikipedia.org/wiki/Vĩnh_Long_Province#Administrative_divisions):

```{r eval = FALSE}
census <- census_path %>%
  readRDS() %>% 
  group_by(province, district) %>% 
  summarise(n = sum(n)) %>% 
  ungroup() %>% 
  mutate(province = str_squish(province) %>%
                    str_remove_all("Thành phố |Tỉnh ") %>% 
                    str_replace("Đăk Lăk"             , "Đắk Lắk") %>% 
                    str_replace("Đăk Nông"            , "Đắk Nông") %>% 
                    str_replace("Khánh Hoà"           , "Khánh Hòa") %>% 
                    str_replace("Thanh Hoá"           , "Thanh Hóa"),
         district = str_squish(district) %>%
                    str_replace("Thành phố Cao Lãnh"  , "Cao Lãnh (Thành phố)") %>% 
                    str_replace("Thị xã Hồng Ngự"     , "Hồng Ngự (Thị xã)") %>% 
                    str_replace("Thị xã Kỳ Anh"       , "Kỳ Anh (Thị xã)") %>% 
                    str_replace("Thị xã Long Mỹ"      , "Long Mỹ (Thị xã)") %>% 
                    str_replace("Thị xã Cai Lậy"      , "Cai Lậy (Thị xã)") %>% 
                    str_replace("Thị xã Duyên Hải"    , "Duyên Hải (Thị xã)") %>% 
                    str_remove_all("Huyện |Huỵên |Quận |Thành phố |Thành Phô |Thành Phố |Thị xã |Thị Xã ") %>% 
                    str_replace("Hoà Vang"            , "Hòa Vang") %>% 
                    str_replace("Ứng Hoà"             , "Ứng Hòa") %>% 
                    str_replace("Đồng Phù"            , "Đồng Phú") %>% 
                    str_replace("Đắk Glong"           , "Đăk Glong") %>% 
                    str_replace("Ia H’Drai"           , "Ia H' Drai") %>% 
                    str_replace("ý Yên"               , "Ý Yên") %>% 
                    str_replace("Bác ái"              , "Bác Ái") %>% 
                    str_replace("Phan Rang- Tháp Chàm", "Phan Rang-Tháp Chàm") %>% 
                    str_replace("Đông Hoà"            , "Đông Hòa") %>% 
                    str_replace("Tuy Hòa"             , "Tuy Hoà") %>% 
                    str_replace("Thiệu Hoá"           , "Thiệu Hóa")) %>% 
  bind_rows(
    bind_cols(
      data.frame(province = rep("Vĩnh Long", 8)),
      tribble(
        ~district  ,     ~n,
        "Bình Tân" ,  93758, # 2009
        "Long Hồ"  , 160537, # 2018
        "Mang Thít", 103573, # 2018
        "Tam Bình" , 162191, # 2003
        "Trà Ôn"   , 149983, # 2003
        "Vũng Liêm", 176233, # 2003
        "Bình Minh",  95282, # 2003
        "Vĩnh Long", 200120  # 2018
      )
    ),
    tribble(
      ~province  , ~district     , ~n, 
      "Điện Biên", "Mường Ảng"   ,  37077, # 2006
      "Hải Phòng", "Bạch Long Vĩ",    912, # 2018
      "Phú Thọ"  , "Thanh Sơn"   , 187700, # 2003
      "Quảng Trị", "Cồn Cỏ"      ,    400  # 2003
    )
  )
```

```{r include = FALSE, eval = FALSE}
saveRDS(census, "census.rds")
```

```{r include = FALSE}
census <- readRDS("census.rds")
```

Let's check that the names of the provinces in the GADM and census data sets are
the same:

```{r}
identical(sort(unique(census$province)), sort(unique(vn2$province)))
```

Let's check that the districts in the GADM and the census data are the same:

```{r}
nrow(anti_join(census, vn2, c("province", "district"))) < 1
nrow(anti_join(vn2, census, c("province", "district"))) < 1
```

Let's merge the census and GADM data sets:

```{r}
vn2 %<>% left_join(census, c("province", "district"))
```

Let's check that no district is duplicated:

```{r}
vn2 %>% 
  st_drop_geometry() %>% 
  group_by(province, district) %>% 
  tally() %>% 
  filter(n > 1) %>% 
  nrow() %>% 
  is_less_than(1)
```

## Merging some districts

The colocation data use the [Bing](https://www.bing.com/maps) polygons, which do
not seem to be very much up to date. In order to adjust for that, we need to
merge the following districts in the GADM data:

* Hà Tĩnh: merge Kỳ Anh (Thị xã) into Kỳ Anh that split in 2015
* Hậu Giang: merge Long Mỹ (Thị xã) into Long Mỹ that split in 2015
* Tiền Giang: merge Cai Lậy (Thị xã) into Cai Lậy that split in 2013
* Trà Vinh: merge Duyên Hải (Thị xã) into Duyên Hải that split in 2015
* Bình Dương: merge Bắc Tân Uyên into Tân Uyên that split in 2013
* Bình Dương: merge Bàu Bàng into Bến Cát that split in 2013
* Hà Nội: merge Bắc Từ Liêm and Nam Từ Liêm into Từ Liêm that split in 2013
* Long An: merge Kiến Tường into Mộc Hóa that split in 2013
* Nghệ An: merge Hoàng Mai into Quỳnh Lưu that split in 2013
* Quảng Bình: merge Ba Đồn into Quảng Trạch that split in 2013

Furthermore, some of the districts need to be split in two, with each part
merged to a different district. That's the case for these 3 districts:

* Điện Biên: split Nậm Pồ into 2 parts and merge them with Mường Chà and Mường Nhé (2012)
* Lai Châu: split Nậm Nhùn into 2 parts and merge them with Mường Tè and Sìn Hồ (2012)
* Tuyên Quang: split Lâm Bình into 2 parts and merge them with Nà Hang and Chiêm Hóa (2011)

As illustrated below:

```{r}
plot_districts <- function(d1, d2, d3) {
  tmp <- vn2 %>% 
    filter(district %in% c(d1, d2, d3)) %>% 
    st_geometry()
  
  plot(tmp)
  plot(worldpop, add = TRUE, main = NA)
  plot(tmp, add = TRUE, col = adjustcolor("green", .2))
  
  vn2 %>% 
    filter(district == d1) %>% 
    st_geometry() %>% 
    plot(add = TRUE, col = adjustcolor("red", .2))
  
  vn2_old %>% 
    filter(district %in% c(d2, d3)) %>% 
    st_geometry() %>% 
    plot(add = TRUE, border = "blue", lwd = 2)
}
```

```{r}
plot_districts("Nậm Pồ"  , "Mường Chà", "Mường Nhé")
plot_districts("Lâm Bình", "Nà Hang"  , "Chiêm Hóa")
plot_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ")
```

The raster data shows the population density from
[WorldPop](https://www.worldpop.org) that we'll use to split the population of
the red district into the 2 blue districts. Here are two functions to fix these
2 above-mentioned problems. Here is the first function:

```{r}
merge_districts <- function(vn, d1, d2, p) {
  dst <- c(d1, d2)
  
  tmp <- vn %>% 
    filter(district %in% dst, province == p) %>% 
    mutate(n = sum(n))
  
  tmp %<>% 
    filter(district %in% d1) %>% 
    st_set_geometry(st_union(tmp))
  
  vn %>% 
    filter(! (district %in% dst & province == p)) %>% 
    rbind(tmp) %>% 
    arrange(province, district)
}
```

The second function needs this function:

```{r}
proportion <- function(to_cut, one_district, new_vn = vn2, old_vn = vn2_old, rstr = worldpop) {
  to_cut <- filter(new_vn, district == to_cut)
  one_part <- st_intersection(to_cut, filter(old_vn, district == one_district))
  
  wp0 <- rstr %>%
    st_crop(to_cut) %>% 
    st_as_stars() %>% 
    unlist() %>% 
    sum(na.rm = TRUE)
  
  rstr %>% 
    st_crop(one_part) %>% 
    st_as_stars() %>% 
    unlist() %>% 
    sum(na.rm = TRUE) %>% 
    divide_by(wp0)
}
```

Let's try it:

```{r warning = FALSE, message = FALSE}
proportion("Nậm Nhùn", "Mường Tè" , vn2, vn2_old, worldpop)
proportion("Nậm Pồ"  , "Mường Chà", vn2, vn2_old, worldpop)
proportion("Lâm Bình", "Nà Hang"  , vn2, vn2_old, worldpop)
```

And here is the second function we needed:

```{r}
merge_back_districts <- function(c2, d1, d2, d3, c1 = vn2_old, rstr = worldpop) {
  dsts <- c(d1, d2, d3)
  
  tmp <- c2 %>% 
    filter(district %in% dsts) %$%
    setNames(n, district)

  half1 <- round(proportion(d1, d2, c2, c1, rstr) * tmp[d1])

  half2 <- tmp[d1] - half1
  tmp[d2] <- tmp[d2] + half1
  tmp[d3] <- tmp[d3] + half2
  tmp <- tmp[dsts[-1]]
  
  c1 %>% 
    filter(district %in% dsts[-1]) %>% 
    mutate(n = tmp) %>% 
    select(everything(), geometry) %>% 
    rbind(filter(c2, ! district %in% dsts)) %>% 
    arrange(province, district)
}
```

Let's now call these 2 functions to do the mergings:

```{r warning = FALSE, message = FALSE}
vn2 %<>%
  merge_districts("Kỳ Anh"     , "Kỳ Anh (Thị xã)"   , "Hà Tĩnh") %>% 
  merge_districts("Long Mỹ"    , "Long Mỹ (Thị xã)"  , "Hậu Giang") %>% 
  merge_districts("Cai Lậy"    , "Cai Lậy (Thị xã)"  , "Tiền Giang") %>% 
  merge_districts("Duyên Hải"  , "Duyên Hải (Thị xã)", "Trà Vinh") %>% 
  merge_districts("Tân Uyên"   , "Bắc Tân Uyên"      , "Bình Dương") %>%
  merge_districts("Bến Cát"    , "Bàu Bàng"          , "Bình Dương") %>% 
  merge_districts("Bắc Từ Liêm", "Nam Từ Liêm"       , "Hà Nội") %>% # then rename to Từ Liêm
  merge_districts("Mộc Hóa"    , "Kiến Tường"        , "Long An") %>% 
  merge_districts("Quỳnh Lưu"  , "Hoàng Mai"         , "Nghệ An") %>% 
  merge_districts("Quảng Trạch", "Ba Đồn"            , "Quảng Bình") %>% 
  merge_back_districts("Nậm Pồ"  , "Mường Chà", "Mường Nhé") %>% 
  merge_back_districts("Nậm Nhùn", "Mường Tè" , "Sìn Hồ") %>% 
  merge_back_districts("Lâm Bình", "Nà Hang"  , "Chiêm Hóa") %>% 
  mutate(district = str_replace(district, "Bắc Từ Liêm", "Từ Liêm")) # here the renaming
```

Let's calculate and add the areas:

```{r}
vn2 %<>% 
  mutate(area_km2 = vn2 %>%
           st_geometry() %>% 
           st_area() %>% 
           as.numeric() %>% 
           divide_by(1e6)) %>% 
  select(everything(), geometry)
```

## Visualizations of the GADM / census data

### Population sizes

The distribution of the districts' population sizes:

```{r}
hist2(vn2$n, n = 50, xlab = "population size", ylab = "number of districts", axes = FALSE)
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)
```

Let's define a palette of colors:

```{r}
cb <- RColorBrewer::brewer.pal(9, "YlOrBr")
color_generator <- colorRampPalette(cb)
pal <- color_generator(10)
```

The distribution of the districts' population sizes where all the bars are of
the same area and represent one decile of the data:

```{r}
hist2(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = "population size", ylab = "density of probability")
axis(1, seq(0, 1e6, 2e5), format(seq(0, 10, 2) * 1e5, big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2)
```

Let's map the population sizes of the districts:

```{r}
vn2 %>% 
  st_geometry() %>% 
  plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)
```

The mean and variance of the district population sizes:

```{r}
mean(vn2$n)
median(vn2$n)
```

### Areas

The distribution of the districts' areas:

```{r margin1 = FALSE, margin2 = TRUE}
hist2(vn2$area_km2, n = 50,
      xlab = expression(paste("area ", (km^2))), ylab = "number of districts")
```

Mean and variance of the districts's areas:

```{r}
mean(vn2$area_km2)
median(vn2$area_km2)
```

### Densities

Some quantiles of the districts' densities:

```{r}
(quants <- quantile(vn2$n / vn2$area_km2, c(.025, .25, .5, .75, .975)))
```

The distribution of the districts' densities, on a log scale, with quantiles:

```{r margin1 = FALSE, margin2 = TRUE}
hist2(log10(vn2$n / vn2$area_km2), n = 50, axes = FALSE,
      xlab = expression(paste("density (/", km^2, ")")), ylab = "number of districts")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)
abline(v = log10(quants), lty = c(3, 2, 1, 2, 3), col = "blue", lwd = 2)
```

Same distribution as above where all the bars have the same area representing
10% of the data:

```{r margin1 = FALSE, margin2 = TRUE}
xs <- log10(vn2$n / vn2$area_km2)
hist2(xs, quantile(xs, seq(0, 1, le = 11)), col = pal, axes = FALSE,
      xlab = expression(paste("density (/", km^2, ")")), ylab = "density of probability")
axis(1, 1:4, c("10", "100", "1000", "10000"))
axis(2)
```

Mapping the districts' populations densities:

```{r}
vn2 %>% 
  transmute(dens = n / area_km2) %>% 
  st_geometry() %>% 
  plot(lwd = .1, col = pal[cut(vn2$n, quantile(vn2$n, seq(0, 1, le = 11)))], main = NA)

vn0 %>% 
  st_geometry() %>% 
  plot(add = TRUE)
```

The relationship between the population size and the population density:

```{r}
vn2 %>% 
  mutate(dens = n / area_km2) %$%
  plot(log10(n), log10(dens), col = "blue", axes = FALSE, xlab = "population size",
       ylab = expression(paste("population density (/", km^2, ")")))

axis(1, 3:6, format(10^(3:6), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10^(1:4), big.mark = ",", scientific = FALSE, trim = TRUE))
```

meaning that the density is increasing with some power of the population size.

## Colocation data

The list of colocation data files:

```{r}
files <- dir(colocation_path)
```

There are `r length(files)` of them:

```{r}
length(files)
```

Making names from files names:

```{r}
weeks <- str_remove_all(files, "^.*__|.csv")
```

Loading the colocation data into a list (one slot per week):

```{r message = FALSE, eval = FALSE}
colocation <- paste0(colocation_path, dir(colocation_path)) %>%
  map(readr::read_csv) %>%
  setNames(weeks) %>% 
  map(select, -country, -ds) # remove the country code (useless) and the ds (which is the name of the slot)
```

```{r include = FALSE, eval = FALSE}
saveRDS(colocation, "colocation1.rds")
```

```{r include = FALSE}
colocation <- readRDS("colocation1.rds")
```

The colocation data looks like this:

```{r}
head(colocation, 1)
```

The slot names are the last day of the 7-day period over which the data are
collected:

```{r}
names(colocation)
```

## Getting rid off whatever is not linked to Vietnam

### The problem

Here we show what the problem is (i.e. some of the data are outside Vietnam).
The following function plots the colocation data:

```{r}
plot_fb <- function(df, xlim, ylim, col) {
  plot(xlim, ylim, asp = 1, xlab = "longitude", ylab = "latitude", type = "n")
  maps::map(col = "grey", fill = TRUE, add = TRUE)
  points(df[[1]], df[[2]], pch = ".", col = col)
  axis(1)
  axis(2)
  box(bty = "o")
}
```

Let's consider one week:

```{r}
june23 <- colocation$`2020-06-23`
```

The locations in this week are within these boundaries:

```{r}
(xlim <- range(range(june23$lon_1), range(june23$lon_2)))
(ylim <- range(range(june23$lat_1), range(june23$lat_2)))
```

Let's plot the polygon 1:

```{r}
june23 %>%
  select(lon_1, lat_1) %>% 
  distinct() %>% 
  plot_fb(xlim, ylim, "blue")
```

And the polygon 2:

```{r}
june23 %>%
  select(lon_2, lat_2) %>% 
  distinct() %>% 
  plot_fb(xlim, ylim, "red")
```

### The solution

In the following function, `df` is a data frame with the same column names as
a "colocation map" data frame. `pl` is an `sf` non-projected polygon. `type` is
either `1` or `2`.

```{r}
pts_in_pol <- function(type, df, pl, project = FALSE) {
  # assumes that sf is not projected.
  # 4326: non-projected
  # 3857: pseudo-Mercator (e.g. Google Maps)
  df %<>% st_as_sf(coords = paste0(c("lon_", "lat_"), type), crs = 4326)
  if (project) {
    df %<>% st_transform(3857)
    pl %<>% st_transform(3857)
  }
  df %>%
    st_intersects(pl) %>% 
    map_int(length)
}
```

The function returns a vector of length equal to the number of rows of `df` with
`1` if the points is inside the polygon `pl` and `0` otherwise. Let's try it:

```{r}
tmp <- pts_in_pol(1, june23, vn0)
```

It takes 3.5' and it's about 3 times slower if we project the points and polygon.
The arguments of the following function are the same as the previous one. It
uses the previous one to delete the records from `df` that have start and end
coordinates that are outside `pl`:

```{r}
rcd_in_pol <- function(df, pl, project = FALSE) {
  require(magrittr)
  1:2 %>%
    parallel::mclapply(pts_in_pol, df, pl, project, mc.cores = 2) %>% 
    as.data.frame() %>%
    rowSums() %>% 
    is_greater_than(0) %>% 
    magrittr::extract(df, ., ) # there is an extract() function in tidyr too
}
```

Let's try it:

```{r eval = FALSE}
june23 %<>% rcd_in_pol(vn0)
```

Let's process all the data (takes about 1 minute):

```{r}
colocation %<>% map(rcd_in_pol, vn0)
```

```{r include = FALSE, eval = FALSE}
saveRDS(colocation, "colocation2.rds")
```

```{r}
colocation <- readRDS("colocation2.rds")
```

## Working out a district ID common to GADM + census and colocation data

The list of district in the colocation data:

```{r}
col_names <- c("polygon_id", "polygon_name", "lon", "lat", "name_stack")

districts1 <- map_dfr(colocation, select, polygon1_id, polygon1_name, lon_1, lat_1, name_stack_1) %>% 
  setNames(col_names)

districts2 <- map_dfr(colocation, select, polygon2_id, polygon2_name, lon_2, lat_2, name_stack_2) %>% 
  setNames(col_names)

districts <- bind_rows(districts1, districts2) %>%
  distinct()
```

This is what it looks like:

```{r}
districts
```

### Province name missing for some districts of Hanoi

In the colocation data, the `name_stack_*` variables contains the names of the
province and the district separated by ` // `. The problem is that there are a
number of districts that do not have ` // ` in their `name_stack` variable and
all of them seem to be in the province of Hanoi:

```{r}
plot(st_geometry(vn0), col = "grey")

vn1 %>%
  filter(NAME_1 == "Hà Nội") %>%
  st_geometry() %>%
  plot(add = TRUE, col = "yellow")

districts %>% 
  filter(! grepl(" // ", name_stack)) %>% 
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>% 
  st_geometry() %>% 
  plot(add = TRUE, col = "red")
```

### Separating province name from district name in the colocation data

Plus a number of names fixes:

```{r}
districts %<>% 
  separate(name_stack, c("province", "district"), " // ") %>% 
  mutate(indicate = is.na(district),
         district = ifelse(indicate, province, district),
         province = ifelse(indicate, "Hanoi City", province) %>% 
           str_squish() %>% 
           str_remove(" Province| City") %>% 
           str_replace("-", " - ") %>% 
           str_replace("Da Nang"     , "Đà Nẵng") %>%
           str_replace("Hanoi"       , "Hà Nội") %>% 
           str_replace("Hai Phong"   , "Hải Phòng") %>%
           str_replace("Ho Chi Minh" , "Hồ Chí Minh") %>%
           str_replace("Hòa Bình"    , "Hoà Bình"),
         polygon_name = str_squish(polygon_name) %>% 
           str_replace("Thành Phố Cao Lãnh", "Cao Lãnh (Thành phố)") %>% 
           str_replace("Thị Xã Hồng Ngự", "Hồng Ngự (Thị xã)") %>% 
           str_remove("Huyện |Thành phố |Thị xã |Quận |Thành Phố |Thị Xã ") %>%
           str_replace("Quy Nhơn"    , "Qui Nhơn") %>% 
           str_replace("Đảo Phú Quý" , "Phú Quí") %>% 
           str_replace("Bình Thủy"   , "Bình Thuỷ") %>% 
           str_replace("Hòa An"      , "Hoà An") %>% 
           str_replace("Phục Hòa"    , "Phục Hoà") %>% 
           str_replace("Thái Hòa"    , "Thái Hoà") %>% 
           str_replace("Hạ Hòa"      , "Hạ Hoà") %>% 
           str_replace("Phú Hòa"     , "Phú Hoà") %>% 
           str_replace("Tây Hòa"     , "Tây Hoà") %>% 
           str_replace("Tuy Hòa"     , "Tuy Hoà") %>% 
           str_replace("Krông Ana"   , "Krông A Na") %>% 
           str_replace("Krông A Na"  , "Krông A Na") %>% ##
           str_replace("Krông Păk"   , "Krông Pắc") %>% 
           str_replace("Krông Pắc"   , "Krông Pắc") %>% ##
           str_replace("Đắk Glong"   , "Đăk Glong") %>% 
           str_replace("Đắk Rlấp"    , "Đắk R'Lấp") %>% 
           str_replace("A Yun Pa"    , "Ayun Pa") %>% 
           str_replace("Từ Liêm"     , "Nam Từ Liêm") %>% 
           str_replace("Kiến Thụy"   , "Kiến Thuỵ") %>% 
           str_replace("Thủy Nguyên" , "Thuỷ Nguyên") %>% 
           str_replace("Vị Thủy"     , "Vị Thuỷ") %>% 
           str_replace("Bác Ai"      , "Bác Ái") %>% 
           str_replace("Thanh Thủy"  , "Thanh Thuỷ") %>% 
           str_replace("Yên Hưng"    , "Quảng Yên") %>% 
           str_replace("Na Hang"     , "Nà Hang") %>% 
           str_replace("Mù Cang Chải", "Mù Căng Chải") %>%
           str_replace("M`Đrắk"      , "M'Đrắk") %>% 
           str_replace("Cư M`Gar"    , "Cư M'gar") %>% 
           str_replace("Ea H`Leo"    , "Ea H'leo") %>% 
           str_replace("Nam Từ Liêm" , "Từ Liêm") %>% 
           str_replace("Buôn Hồ"     , "Buôn Hồ"),
         polygon_name = ifelse(province == "Bạc Liêu" & polygon_name == "Hòa Bình",
                               "Hoà Bình", polygon_name)) %>% 
  select(-indicate, -district) %>% 
  rename(district = polygon_name)
```

The warnings are produced when the provinces names are missing for some of the
districts of Hanoi. Some districts do not have any information in the colocation
data:

```{r}
anti_join(districts, vn2, c("province", "district"))
anti_join(vn2, districts, c("province", "district"))
```

Let's map these districts that never have any data:

```{r warning = FALSE}
vn0 %>% 
  st_geometry() %>% 
  plot(col = "grey")

tmp <- vn2 %>% 
  mutate(pd = paste(province, district)) %>% 
  filter(! pd %in% with(districts, paste(province, district)))

tmp %>% 
  st_geometry() %>% 
  plot(add = TRUE, col = "red")

tmp %>%
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  filter(between(Y, 15, 20.5)) %>%
  st_as_sf(coords = c("X", "Y")) %>% 
  plot(add = TRUE, col = "red")
```

Let's add these 5 districts to `districts`:

```{r warning = FALSE}
tmp <- vn2 %>% 
  mutate(pd = paste(province, district)) %>% 
  filter(! pd %in% with(districts, paste(province, district)))

districts <- tmp %>%
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  transmute(lon = X, lat = Y) %>% 
  bind_cols(tmp) %>% 
  mutate(polygon_id = 850001:850005) %>% 
  select(polygon_id, district, lon, lat, province) %>% 
  bind_rows(districts)
```

This is what it looks like now:

```{r}
districts
```

Let's merge this information with the polygon / census data:

```{r}
vn2 %<>% 
  left_join(districts, c("province", "district")) %>% 
  select(polygon_id, province, district, n, area_km2, lon, lat, geometry)
```

and this what it looks like now:

```{r}
vn2
```

## Exploring the colocation data

### Coverage: comparing facebook population with census population

The following function combines the facebook data with the GADM / census data
for each week:

```{r}
dist_fb_populations <- function(x) {
  x %>%
    select(polygon1_id, fb_population_1) %>% 
    distinct() %>% 
    right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
}
```

Let's compute it:

```{r}
tmp <- map(colocation, dist_fb_populations)
```

The facebook population of each district does not seem to change as a function
of time:

```{r}
xs <- ymd(names(colocation)) - 6
plot(xs, seq_along(xs), ylim = c(0, 5), type = "n")

tmp %>% 
  map_dfc(pull, fb_population_1) %>% 
  t() %>% 
  as.data.frame() %>% 
  map(log10) %>% 
  walk(lines, x = xs, col = adjustcolor("black", .25))
```

And the distribution accross district looks like this:

```{r}
tmp %>% 
  map(mutate, prop = fb_population_1 / n) %>% 
  first() %>%
  pull(prop) %>%
  hist2(50, xlab = "facebook coverage", ylab = "number of districts")
```

Let's look at the facebook coverage as a function of the population size for the
first week of the colocation data:

```{r}
tmp <- colocation$`2020-03-03` %>%
  select(polygon1_id, fb_population_1) %>% 
  distinct() %>% 
  right_join(select(st_drop_geometry(vn2), -area_km2), c("polygon1_id" = "polygon_id"))
  
summary(lm(log10(fb_population_1) ~ log10(n), tmp))

with(tmp, plot(log10(n), log10(fb_population_1), col = "blue", axes = FALSE,
               ylim = c(1, 4.5), xlim = c(3.8, 6),
               xlab = "district population", ylab = "facebook population"))
axis(1, 4:6, format(10000 * c(1, 10, 100), big.mark = ",", scientific = FALSE, trim = TRUE))
axis(2, 1:4, format(10 * c(1, 10, 100, 1000), big.mark = ",", scientific = FALSE, trim = TRUE))
```

Meaning that the facebook coverage increases with population size.

### The number of non-null links per district

There is no missing value and no zeros in the link variable:

```{r}
link_val <- map(colocation, pull, link_value)
link_val2 <- unlist(link_val)
any(is.na(link_val))
range(link_val)
```

A function that computes the number of non-null links per district:

```{r}
nb_links <- function(x) {
  x %>%
    group_by(polygon1_id) %>%
    tally() %>% 
    right_join(select(districts, c("polygon1_id" = "polygon_id"))) %>% 
    mutate(n = replace_na(n, 0)) %>% 
    pull(n)
}
```

Let's look at the distribution of the number of non-null links per district as
a function of time:

```{r}
cols <- RColorBrewer::brewer.pal(9, "Blues")

ld <- ymd(c(20200401, 20200423)) # dates of the lockdown
xlim <- ymd(c(20200229, 20200701))
ys <- c(0, 700)

tmp <- colocation %>% 
  map(nb_links) %>%
  map_dfc(quantile, 0:10 / 10) %>% 
  t() %>% 
  as.data.frame()

xs_tr <- c(xs[1], rep(xs[-1], each = 2), last(xs) + mean(diff(xs)))
xs2 <- c(xs_tr, rev(xs_tr))

plot(xs, seq_along(xs), ylim = c(0, 700), type = "n", xlim = xlim,
     xlab = NA, ylab = "number of non-null links per district")

for (i in 1:5)
  polygon(xs2, c(rep(tmp[[i]], each = 2), rev(rep(tmp[[12 - i]], each = 2))),
          col = cols[i], border = NA)

lines(xs_tr, rep(tmp[[6]], each = 2), col = cols[6], lwd = 2)
abline(h = nrow(districts), col = "grey", lty = 2)
polygon(c(ld, rev(ld)), rep(ys, each = 2), col = adjustcolor("red", .1), border = NA)
```

where the red area materializes the lockdown.

### Probability per links

Let's look at the sum of the probability per week:

```{r}
plot(xs, map_dbl(link_val, sum), type = "s", col = "blue", xlim = xlim,
     xlab = NA, ylab = "sums of probabilities", lwd = 2)
polygon(c(ld, rev(ld)), rep(ys, each = 2), col = adjustcolor("red", .25), border = NA)
```

The 10th week looks suspicious.

## Rearranging the colocation data into a matrix

Let's generate a template with all the combinations of districts:

```{r}
template <- districts %>% 
  arrange(polygon_id) %$%
  expand.grid(polygon_id, polygon_id) %>% 
  as_tibble() %>% 
  setNames(c("polygon1_id", "polygon2_id"))
```

The function that transforms the colocation data into a matrix:

```{r}
to_matrix <- function(df, template) {
  dim_names <- sort(unique(template$polygon1_id))
  df %>% 
    select(polygon1_id, polygon2_id, link_value) %>% 
    left_join(template, ., c("polygon1_id", "polygon2_id")) %>%
    mutate(link_value = replace_na(link_value, 0)) %>% 
    pull(link_value) %>% 
    matrix(nrow(districts)) %>%
    `colnames<-`(dim_names) %>% 
    `rownames<-`(dim_names)
}
```

Let's do it for all the weeks and sum them:

```{r eval = FALSE}
coloc_mat <- colocation %>% 
  map(to_matrix, template) %>% 
  reduce(`+`)
```

```{r include = FALSE, eval = FALSE}
saveRDS(coloc_mat, "coloc_mat.rds")
```

```{r include = FALSE}
coloc_mat <- readRDS("coloc_mat.rds")
```

Let's have a look at this matrix. Let's first order the district from south to
north and from west to east:

```{r}
hash <- setNames(seq_along(colnames(coloc_mat)),
                           colnames(coloc_mat))
ind <- districts %>% 
  arrange(lat, lon) %>% 
  pull(polygon_id) %>% 
  as.character() %>% 
  magrittr::extract(hash, .)

coloc_mat <- coloc_mat[ind, ind]
```

Let's now plot the matrix:

```{r margin1 = FALSE, margin3 = TRUE}
opar <- par(pty = "s")
image(log10(apply(t(coloc_mat), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")
```

We can verify that the matrix is symmetric and that both Hanoi and Saigon are
well connected to everywhere in the country. We can see also the district that
are not connected at all.

### Subsetting according to coordinates

Let's say we want to select all the provinces from the Northern EPI, the
southernmost province of which is Nghệ An. Let's retrieve the latitude of the
centroids of all the provinces:

```{r}
tmp <- vn1 %>% 
  st_centroid() %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  pull(Y) %>% 
  mutate(vn1, lat_cent = .) %>% 
  select(NAME_1, lat_cent)
```

The province south of Nghệ An is Hà Tĩnh, the latitude of centroid of which is:

```{r}
threshold <- tmp %>% 
  filter(NAME_1 == "Hà Tĩnh") %>% 
  pull(lat_cent)
```

Now, we retrieve all the names of all the provinces that are north of this
threshold:

```{r}
northernEPI <- tmp %>% 
  filter(lat_cent > threshold) %>% 
  pull(NAME_1)
```

Now, we retrieve the corresponding districts' ID:

```{r}
sel <- districts %>% 
  filter(province %in% northernEPI) %>% 
  pull(polygon_id)
```

And now we can subset our matrix:

```{r}
sel <- colnames(coloc_mat) %in% sel
coloc_mat_northernEPI <- coloc_mat[sel, sel]
```

Which gives:

```{r margin1 = FALSE, margin3 = TRUE}
opar <- par(pty = "s")
image(log10(apply(t(coloc_mat_northernEPI), 2, rev)), axes = FALSE)
par(opar)
box(bty = "o")
```

## Exploring the colocation data


```{r}
colocation$`2020-03-03` %>% 
  select(polygon1_id, link_value, fb_population_1) %>% 
  group_by(polygon1_id) %>% 
  summarise(link   = sum(link_value),
            fb_pop = unique(fb_population_1)) %>%
  map(log) %$% 
  plot(fb_pop, link)
```


### Gravity model

One model fairly used in geography is the so-called gravity model that says that
the connection between any 2 locations $i$ and $j$ is:

$$
C_{ij} = \theta\frac{P_i^{\tau_1}P_j^{\tau_2}}{d_{ij}^\rho}
$$
See for example
[10.1126/science.1125237](https://science.sciencemag.org/content/312/5772/447.abstract).

*TO DO:* test this model on the facebook data.

### Effective distance

See [10.1126/science.1245200](https://science.sciencemag.org/content/342/6164/1337)