RGIS2_MergingSpatialData_part2_GeometricManipulations.Rmd

---
title: "Merging Spatial Data"
author: "By Nick Eubank, building off excellent tutorials by Claudia Engel"
output:
  html_document:
    toc: true
    toc_depth: 4
    theme: spacelab
    mathjax: default
    fig_width: 6
    fig_height: 6
---


```{r knitr_init, echo=FALSE, cache=FALSE, message=FALSE,results="hide", warning=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
	             cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)

library(rgdal)
library(sp)
library(rgeos)
library(plyr)
library(raster)


setwd("~/dropbox/gis_in_r")

pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer="pk_districts")
strikes <- readOGR(dsn = "RGIS2_data/shapefiles", layer="pk_drone_strikes")
dist.crs <-CRS(proj4string(pk.dist))
strikes.projected <- spTransform(strikes, dist.crs)
pk.pop <- raster("RGIS2_Data/pakp00g")
new.crs <- CRS(projection(pk.pop))
pk.dist.rasterproj <- spTransform(pk.dist, new.crs)


#setwd("E:/Users/neubank/dropbox/gis_in_r")
districts<-readOGR("RGIS2_Data/shapefiles",'congressional_districts')
grants<-readOGR("RGIS2_Data/shapefiles",'federal_grants')
newcrs<-CRS(proj4string(districts))
grants.newproj <- spTransform(grants, newcrs)


```

***

When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data from in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects. 

This tutorial uses the `sp`, `rgdal`, and `raster` libraries from the `RGIS1` tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the `rgeos` library, installation of which is discussed in `part0_setup`. 

***

# 1. RGEOS Overview

The `over` command will cover a remarkable number of situations, but it is really only interesting if two shapes intersect exactly. In practice, however, we are often interested in sligthly more flexible questions: how many cities are *within 10km* of a drone strike? Or how many people live close to a government project? 

When we want to do fancer geometric operations, we use the `rgeos` library. `rgeos` (which stands for "R interface to the *Geometry Engine - Open Source* library") is a set of tools for geometric operations. `GEOS` a huge library, and one that underlies many spatial tools (not just in R). Whenever you're thinking about a geometric operation, `rgeos` is the first place to look.


`rgeos` tools can be broadly divided into three camps. Here some of the most commonly used ones. 

### Calculating Properties

* **gArea**: calculate area of a shape
* **gLength**: calculates length of line / circumference of polygons
* **gDistance**: distance between items

### Making New Shapes

* **gBuffer**: Expand points into circles of given radius
* **gCentroid**: Collapse polygons to their centroids
* **gDifference**, **gUnion**, and **gIntersection**: execute set operations on polygons
* **gUnionCascaded**: dissolves a collection of shapes into a single shape. 
* **gSimplify**: If your polyon is too high a resolution (higher than needed for analysis, and too high for fast processing), reduces the number of vertices while maintaining shape to the best of it's ability. 


### Testing Geometric Relationships

* **gIntersects**: test if shapes intersect. Primarily useful for testing whether two polygons intersect, since this is not something`over` can do. 
* **gContains**: Is one spatial object entirely within another?
* **gIsValid**: Very useful -- make sure your geometries aren't corrupt!


# 2. Key RGEOS Concepts

The syntax for these tools varies both across tools and depending on the input types, there are three concepts that come up constantly.

## 2.1: Units
`rgeos` isn't a spatial library -- it just does geometry. It takes x-y coordinates and applies geometric formula. Thus the units will always be the units of the x-y coordinates, which come from your projection. So here, we can check the projection to find our units:

```{r}
proj4string(districts)
```
The units (`+units=m`) are meters. 


## 2.2: byid
`byid` is an option on most `rgeos` commands. If `byid=TRUE`, each observation in a Spatial* object is handled separately; if `byid=FALSE`, a Spatial* object is treated as one big geometry. So if one were working with a `SpatialPolygons` object of US states, when `byid=TRUE`, the analysis would be conducted for each state; if `byid=FALSE`, it would essentially run the analysis against the United States as a whole. 

Intuitively, the distinction can often be thought of as the difference between asking whether about a spatial relationship between *each* observation in the Spatial* object on which the tool is being executed, or *any* observation in the Spatial* polygon. 

To see this illustrated, let's consider our data we used before on the location of government grants. Suppose we want to identify regions that are within 7km of government grants. We could either ask for the area within 7km of *each* government project (which would yield a different answer for each grant), or the area within 7km of *any* government grant (which would yield one large area). You can see this distinction in the following plots:

```{r}

buffered.grants.byidTRUE <- gBuffer(grants.newproj, width=7000, byid=TRUE)
buffered.grants.byidFALSE <- gBuffer(grants.newproj, width=7000, byid=FALSE)

par(mfcol=c(1,2))
plot(districts, main="byid TRUE")
plot(buffered.grants.byidTRUE, col='blue', add=T)

plot(districts, main="byid FALSE")
plot(buffered.grants.byidFALSE, col='blue', add=T)

```

As seen in the figures, when `byid=TRUE`, a separate buffer polygon is created around each grant, creating a large number of (overlapping) polygons. When `byid=FALSE`, there are no discrete polygons, just a single feature that covers all points within 7km of **any** government grant.  Indeed, if we look at the summary of these buffers, we see this as well -- in the first case, you see that under `features`, the buffers created with `byid=TRUE` have 6 distinct features (observations), while the buffers created with `byid=FALSE` have only 1. 

```{r}
buffered.grants.byidTRUE
buffered.grants.byidFALSE
```

## 2.3: id

The output of almost all `rgeos` commands will be organized by `id`. For example, with `gArea`, each area has a label: 0, 1, and 2. When working with Spatial*DataFrame objects, `id` will be the rowname for a given observation, just like with tools we've worked with before.

To see an object's `id`s, you can use the combination of the `names` command and the `geometry` command (if you don't use the `geometry` command you'll get column names for objects with associated DataFrames:

```{r}
names(geometry(districts))
```


Things get a little more complicated when `rgeos` creates new polygons using tools like `gBuffer` or `gIntersect`. `rgeos` tends to try and do smart things, but the behavior does vary across tools, so always be careful. 

Here's an example -- let's create buffers (polygons of fixed radius) around our grants the buffers 7km, so we get back polygons for all points within 7km of each project. 

```{r}
proj4string(grants.newproj) # check units!
buffered.grants <- gBuffer(grants.newproj, width=7000, byid=TRUE)
names(geometry(buffered.grants))

plot(districts)
plot(buffered.grants, col='blue', add=T)
```

Note that points has given rise to a new polygon, many of them overlapping, and they were given simple names based on the names of the points that generated them. However, if we want a polygon of points within 7km of *any* grant, we can pass `byid=FALSE`. Now we have one polygon (named `buffer`) instead of 5. 

```{r}
buffered.grants <- gBuffer(grants.newproj, width=7000, byid=FALSE)
names(geometry(buffered.grants))

plot(districts)
plot(buffered.grants, col='blue', add=T)
```


***

## Exercise 1

**Answers below, but no cheating!**

Let's try and figure out what percentage of residents of Pakistans Federally Administered Tribal Areas (FATA, the regions with the most drone strikes) live within 7km of a drone strike. 

#. Subset the pk.dist data for the Province of Fata.

#. Create a buffer of 7km with the `gBuffer` command to help us estimate the total area that is within 7km of *any* drone strike (what value of `byid` should you use?).

#. Plot your buffers over the districts so you can see them.  

***

# 3. Set operations

`rgeos` can also do set operations on polygons, like returning areas in which different polygons coincide, don't coincide.

This can be useful for lots of things. For example, the figure of grant buffers above looks a little odd because the buffered polygons spread into the ocean. We can fix this by "clipping" the buffered polygons using `gIntersection` -- essentially keeping only the part of the buffer that also coincides with districts.

```{r}
intersection <- gIntersection(buffered.grants, districts, byid=TRUE)
plot(districts)
plot(intersection, col='blue', add=T)
```

Interestingly, note that while `buffered.grants` was previously one feature, because we passed `byid=TRUE`, the intersection with the districts executed the intersection for each congressional district, creating three distinct polygons.

```{r}
intersection
```

## 3.1 Managing Output from Set Operations

One challenge with set operations is that the number of features generated by these functions has no mechanical relationship to the number of features that were passed to the set function as inputs. As a result, we can't just `cbind` results back into the original data frame as we've been able to do in past tutorials because outputs may be of entirely different lengths than the inputs! 

`rgeos` tries to address this problem by generating new `id`s for the output of set operations that are somewhat intuitive. In general, new `id`s are concatenations of the `id`s of the objects that gave rise to a new observation. For example, if you ran `gIntersection` and one output polygon was created from the overlap of a polygon with the `id` of `8` from the first Spatial* object passed and a polygon with the `id` `z` from the second Spatial* object, that new polygon would have the `id`: `8 z`. 


We can see this behavior in the `id`s of polygons generated from the intersection we just executed between `buffered.grants` and `districts`:
```{r}
names(geometry(buffered.grants))
names(geometry(districts))
names(geometry(intersection))
```

The challenge with this is that these concatenated `id` variables aren't always easy to merge with the DataFrames of your original Spatial* objects. For example, say we wanted to measure the share of each electoral district that is within 7km of a government grant. To do so, we would want to divide the area within 7km of each grant in each district by the total area of each district. The problem is that if we measure the area within 7km of each grant -- the area of the shapes in the `intersection` object -- we get back a list organized by these concatenated `id`s. 

```{r}
gArea(intersection, byid=TRUE)
```

The following code snippet will convert lists like this into a DataFrame, and will split these concatenated `id` variables into two columns -- one for each component of the `id` -- to make merging easier. This is something you chould be able to copy-and-paste after almost any set operation. 

```{r}
# Run calculation
result <- gArea(intersection, byid=TRUE)

# Convert from list to DataFrame and name result
result <- as.data.frame(result)
colnames(result) <- c("area_near_grants")

# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))

# Split the id into two columns based on the space in the middle. 
# Note the one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")

# Now your results are a data.frame with separate columns for each id component!
result <- cbind(result, new.id.columns)
result

```

Now you can use this reformated data to merge with the original data on districts:
```{r}

# Move the rownames (which form the ids) into a column for merging
districts$id2 <- rownames(as.data.frame(districts))
districts <- merge(districts, result, by="id2", type="left")
as.data.frame(districts)
```

Now we can calculate the share of each district within 7km!

```{r}
districts$area_near_grants / districts$Shape_Area
```


## 3.2 `byid` Issues

One thing to be aware of is that `rgeos` sometimes struggles when `byid` is set to `FALSE`. For example, if we want to clip our buffered polygon by the extent of all electoral districts -- but not add cuts where the polygon crosses the lines between districts -- one might try the following code:

The following code:
```{r, eval=FALSE}
gIntersection(buffered.grants, districts, byid=FALSE)
```

But oddly, this generates the error:
`Error in RGEOSBinTopoFunc(spgeom1, spgeom2, byid, id, drop_lower_td, "rgeos_intersection") :` 
`  TopologyException: no outgoing dirEdge found at 570113.57942913717 4148852.3204373871`

This kind of problem will come up from time to time -- my general solution is to first dissolve the layer we want to consider as one polygon into a single polygon using `gUnionCascaded`, *then* execute the second command:

```{r}
merged.districts <- gUnionCascaded(districts)
plot(merged.districts)

intersection2 <- gIntersection(buffered.grants, merged.districts, byid=FALSE)
intersection2
plot(districts)
plot(intersection2, col='blue', add=T)
```

Now the intersection is only one part!

# 4. Last Word on `rgeos`

`rgeos` is a very large and powerful library, and as a result you may be feeling a little overwhelmed at the moment. This is normal -- because `rgeos` has so many tools and each one behaves somewhat differently, it is not possible to provide comprehensive training for everything in a single tutorial. 
With that in mind, the aim of this tutorial has been to give you a sense of what (a) what the `rgeos` library has to offer so you know where to turn when you run into a problem, and (b) to highlight aspects of `rgeos` that are common to most tools so you're prepared to figure out new tools within the least difficulty possible. But if you don't feel quite as comfortable as you have after other tutorials, that's prefectly normal. 

***

## Exercise 2. 

**Answers below, but no cheating!**

Previously we created polygons for areas within 7km of a drone strike. Now let's try and estimate the share of each district in FATA that is within 7km of a drone strike. (Note: districts are the administrative level below provinces. There are 13 districts within the province of FATA.)

#. Intersect the buffers with Districts using `gIntersection`. 

#. Measure the share of each district within 7km of *any* drone strike using the `gArea` command for each district, and for the buffered strikes in each district. **When you merge the areas of your intersections back in with districts, be sure to look at that table -- why are there missing values? Do you need to worry about them?**

#. Run this analysis just for strikes in 2012. If you've used ArcGIS before, how would you re-run this analysis in ArcGIS? Would it be easier or harder?


***

**Answers to Exercise 1**


```{r, eval=FALSE}
pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer="pk_districts")
strikes.wrongproj <- readOGR(dsn = "RGIS2_data/shapefiles", layer="pk_drone_strikes")
dist.crs <-CRS(proj4string(pk.dist))
strikes <- spTransform(strikes.wrongproj, dist.crs)

# Subset to Fata
fata <- pk.dist[pk.dist$PROVINCE=='Fata',]
fata$OBJECTID <- NULL # Drop column called objectid -- came from shapefile. Just confuses things. 

# Buffer strikes
proj4string(strikes) # Check projection units
strike.buffer <- gBuffer(strikes, width=7000, byid=FALSE)
strike.buffer.inter <- gIntersection(strike.buffer, fata, byid=TRUE)
```

**Answers to Exercise 2**

```{r, eval=FALSE}

#######
# Merge intersection areas with original areas
#######

# Now to merge back in with Districts based on row numbers
result <- gArea(strike.buffer.inter, byid=TRUE)

# Convert from list to DataFrame
result <- as.data.frame(result)
colnames(result) <- c("area_near_strikes")

# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))

# Split the id into two columns based on the space in the middle. 
# Note the one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")

# Now your results are a data.frame with separate columns for each id component!
result <- cbind(result, new.id.columns)

fata$id2 <- rownames(as.data.frame(fata))

fata <- merge(fata, result, by="id2", type="left")

#######
# Calculate share of area within strike buffer
#######

# Note first that in places where no part of a district was within 7km
# of a strike, there were no `intersection` polygons. Thus when we 
# merge our data, we get missing for area near strikes. These need to be zeros. 

fata[is.na(fata$area_near_strikes),"area_near_strikes"]<-0
fata$share_covered <- fata$area_near_strikes / fata$Shape_Area
fata$share_covered

# Only for 2012: add: 
#       strikes <- strikes[strikes$year == 2012,]
# To start of code and just rerun!

```

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.