/
RGIS4_webAPIs_part1.Rmd
282 lines (169 loc) · 14.9 KB
/
RGIS4_webAPIs_part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Web APIs and the Basics of Getting Internet Data"
author: "Nick Eubank"
output:
html_document:
toc: true
toc_depth: 4
theme: spacelab
mathjax: default
fig_width: 6
fig_height: 6
---
```{r knitr_init, echo=FALSE, cache=FALSE, message=FALSE,results="hide"}
library(knitr)
library(rmdformats)
## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
***
This tutorial introduces the idea of a Web API -- a special way of getting information from services like google maps or twitter. Web APIs are often useful in GIS analysis for a number of tasks, like converting street addresses or town names into latitudes and longitudes, or downloading census data from R.
This tutorial also provides some background in the hopes of demystifying the API, whose name suggests a sophisticated tool that those with limited programming experience may find daunting, but which are often just stripped down versions of the web-pages you are used to using.
***
# 1. Demystification
It's no wonder most people think of APIs as a little mysterious and daunting. Ask most people what an API is and they will either give you a blank stare or say "why an Application Programming Interface, of course." And you can usually stump the person who said "an Application Programming Interface" by following-up your question with "but what's an Application Programming Interface?"
This section provides an overview of APIs in general and Web APIs in particular. This is probably a little more background than is necessary for most purposes, but I have found that people tend to shy away from using APIs because they seem mysterious, and I'd like to help fight that conception. The idea of an API is actually to make your life simpler, and the concept of an API can be very powerful for understanding the modern world.
## 1.1 What is an API?
API is an (uninformative) acronym for "Application Programming Interface".
It is often the case that a user working in one programming environment (like R) would like to take advantage of tools written in a different language. The goal of an API is to facilitate this by acting like as a cross between a translator and a messenger between two different programs.
Let's us an example of an API that you are already quite familiar with, even if you didn't realize it. The `rgeos` library from the RGIS2 tutorial is actually just an API for a program called GEOS written in a language called C++. When people started realizing that they wanted to do GIS analysis in R, the realized that it didn't make sense to write a whole new set of tools in R for geometric operations when there was already a really sophisticated program written for this exact purpose. At the same time, however, they didn't want to require R users to export their data to the harddrive, open a different program in a language they might not know, use that program to execute a calculation, then re-import the data.
In steps the API, `rgeos`. When you use `rgeos`, here's what's really happening:
* You tell `rgeos` -- in R -- what you would like to happen and give it the data you want to manipulate,
* `rgeos` translates those commands into commands that `GEOS` can understand,
* `rgeos` converts your data to data that `GEOS` can use.
* `GEOS` then does its analysis, and gives the results to `rgeos`.
* `rgeos` then converts those results back to R data, and gives it back to you!
That's it. It's just a middle man who "speaks" both R and GEOS, and who's willing to run back and forth between these programs to make your life easier!
(Wanna know a secret? Most good libraries in R are actually just APIs for libraries written in C++!)
## 1.2 What is a Web API?
A web API is just a special kind of API that stands between an internet browser and the servers of a company like Google or Twitter. It's job is to accept requests for data written in HTML, convert them into whatever language a company's servers use, get the result requested, and return it to the user in a format that's easy to work with.
This last point is key -- most of the time, what a Web API does is take a web-page you are familiar with (like google maps) and strip away everything that is distracting to a computer program, like nice pictures and fancy formatting.
### The Request
The way your computer accesses information on the internet is by composing a request in the form of a web address, known formally as a URL. Most of the time, however, this happens without you know it. For example, if you type "42" into google and click search, the result just appears. But look in the address bar above, and you can see the exact code used to get you those search results, which is sometimes as simple as `https://www.google.com/#q=42` (though, depending on your computer, it may be much longer). Basically, when you click buttons in your browser, the browser converts your clicks into a URL and sends that request out to the internet.
With a Web API, we skip the step where a user clicks a button in the browser with their mouse, and instead just create customized URLs to ask for the data we need. This is a little less intuitive for humans, but is much easier for computers.
**URL Components**
There are a couple specific components of most URLs, and understanding these components will be helpful for both web APIs, and in other situations like web-scraping. For example, consider the following link to a youtube video: [https://www.youtube.com/watch?v=dQw4w9WgXcQ](https://www.youtube.com/watch?v=dQw4w9WgXcQ).
* **https** : This is the protocol you're saying you want to use to talk to the server.
* **www.youtube.com** : This is the server you want to pass your message to.
* **/watch** : is how you tell the computer the general type of request you're making (in this case, you want to go directly to a video).
* **?v=dQw4w9WgXcQ** : this is what's called a "Query String".
Query Strings are how you send extra information to an internet server. They generally start with a `?` or a `#`, and are followed by a number of variable-value pairs. In this case, for example, you're saying that you want youtube to know that in your request, the variable `v` has the value `dQw4w9WgXcQ`" (the internal name for this particular video).
So basically, this youtube link says: "Hey, Youtube -- I would like to access the program you keep in your `/watch` directory, and when you call that program, please tell it that the variable `v` should be set to `dQw4w9WgXcQ`."
In this way, a URL is a lot like a function call in R where the query string contains the arguments you are passing to the function!
### The Response
The main value of a Web API is that the response it provides to a query is written in a stripped-down format that's easy for a computer to understand. Consider the two following results provided by google maps for the search term "Kalamazoo, MI", one through google maps and one through the google maps API:
![maps](images/maps_noapi.png) ![API](images/maps_api.png)
The figure on the left -- the "human-readable" result -- is easy to look at, but consider how hard it would be to tell a computer to zero in on the one piece of data you want and to ignore the map background, the Google Earth button, the photos from Kalamazoo, the various options for modifying the map, etc.
The figure on the right, by contrast, is just text, and is organized in a format (called JSON) that's very easy to convert into an R list or DataFrame.
## 1.3 Recap of APIs
So, to recap:
* **What is an API?** A tool that allows a user of one program to easily use a different program potentially in a different language.
* **What is a Web API?** A tool that allows users to make requests for data over the internet through carefully formatted URLs and to get responses that are easy to use, regardless of the system being used to actually manage that data on the server.
***
# 2. Querying an API
As noted above, querying an API is the same as openning a website with a specific URL. For example, you can see the results of querying the google geocoding API by just typing the following link into your browser: `https://maps.googleapis.com/maps/api/geocode/json?address=1600+pennsylvania+avenue,+washington+dc`
This can actually very helpful for debugging, but in practice we want to query URLs from inside R. To do this, we use the `RCurl` library.
```{r}
library(RCurl)
url = "https://maps.googleapis.com/maps/api/geocode/json?address=1600+pennsylvania+avenue,+washington+dc"
result <- getURL(url)
# Get first 100 characters so don't print too much
short.results <- substring(result, 1, 400)
cat(short.results)
```
Note the use of `cat` to print the result -- `getURL` will encode things like new-lines as `\n` which the normal print command won't recognize.
***
# 3. Parsing the Results
Congratulations! You know how to load a webpage in R! Now the only problem is you have no idea how to read the result.
## 3.1 JSON
Increasingly, the most common format for web APIs is what is called JSON. JSON's popularity is due in large part to it's flexibility, although as we will see this is a double edged sword.
There are two main data structures in JSON: named lists and unnamed lists (technically called Arrays).
Here's a **named list** with two items named "school name" and "school mascot":
```
{"school name": "stanford", "school mascot": "cardinal"}
```
Note the curly brackets and colons between names and values.
Here's an **unnamed list** with two items:
```
["stanford", "cardinal"]
```
Note the square brackets.
The most important thing about these structures is that they can be nested in one another. For example, either of the following are valid JSON objects:
```
{"school names":["stanford", "berkeley"], "school mascots":["cardinal", "bears"]}
# Or
[
{"school name": "stanford", "school mascot": "cardinal"},
{"school name": "berkeley", "school mascot": "bears"}
]
```
That makes JSON format what is called a **recursive** data structure, meaning that an entry in a JSON list can itself be a JSON list. This is fundamentally different from almost every data structure in R, and likely most data structures you've thought about. One cannot place a vector in a vector in R, a matrix in a matrix, or a DataFrame in a DataFrame. One also can't put an excel table inside the cell of an excel table, or a .csv inside an entry in a .csv. Indeed, the only structure in R that is analogous to this is a `list`, which most R users tend to not use very much.
### Converting JSON formats
Since we are most used to working with data frames and vectors, there's a library that tries to wrangle JSON files into traditional R formats called `jsonlite`. However, it is important to know that it can only do so much -- if you have a JSON with lists nested in lists forever, it can be very hard to convert that structure into a DataFrame!
jsonlite's goal is basically to:
* convert unnamed lists of named lists into DataFrames,
* and unnamed arrays into vectors,
defaulting to using lists when it has no choice.
```{r}
library(jsonlite)
# A simple example. Note that "state" only exists in second item, so becomes NA in output.
example.json <- '[
{"school name": "stanford", "school mascot": "cardinal"},
{"school name": "berkeley", "school mascot": "bears", "State":"CA"}
]'
fromJSON(example.json)
```
```{r}
# Not a very successful conversion...
example.json2 <- '{"school names":["stanford", "berkeley"], "school mascots":["cardinal", "bears"]}'
fromJSON(example.json2)
```
To make this kind of query in R, we'll use the `fromJSON` command in the `jsonlite` library to make the query and convert the result into a more straightforward R format. We'll talk more about JSON file formats below, but here we see that fromJSON converts the results to a DataFrame (actually, a list with one DataFrame called "results" and a simple vector called "status", but we only really care about the former).
In addition, we can use these tools to geocode a huge number of places from a DataFrame. Considering the following toy-data:
```{r}
library(dplyr)
toy.data <- data.frame(addresses=c("denver, CO","881 7th Ave, New York, NY 10019", "Encina Hall, 616 Serra St, Stanford, CA 94305"))
my.geocoder <- function(address){
address.no.spaces <- gsub(" ", "+", address)
url = paste0("https://maps.googleapis.com/maps/api/geocode/json?address=", address.no.spaces)
print(url)
fromJSON(url)$results
}
apply(toy.data, 1, my.geocoder)
```
## 2.2 GeoJSON
## 2.3 XML
# 3. Specific APIs
## 3.1 Google Maps Geo-Coder
One of the most commonly used APIs for GIS work is the google maps geocoding API. Google maps takes any type of query you would type into Google Maps and returns information on Google's best guess for the latitude and longitude associated with your query.
```{r}
library(ggmap)
addresses <- c("1600 Pennsylvania NW, Washington, DC", "denver, co")
locations <- geocode(addresses, source="google", output = "more")
locations
```
### Quality
One of the most important fields in the results you get from the google geo-coder is called "geometry.location_type" -- this tells you how precise google's guess is. In order of precision, [the possible values for this field are](https://developers.google.com/maps/documentation/geocoding/intro#results):
* "ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
* "RANGE_INTERPOLATED" indicates that the returned result reflects an approximation (usually on a road) interpolated between two precise points (such as intersections). Interpolated results are generally returned when rooftop geocodes are unavailable for a street address.
* "GEOMETRIC_CENTER" indicates that the returned result is the geometric center of a result such as a polyline (for example, a street) or polygon (region).
* "APPROXIMATE" indicates that the returned result is approximate.
## 2.2 US Census Shapefile API
Kyle Walker and Bob Rudis have recently published an API that makes downloading shapefiles from the census agency incredibly easy. [A full description of their library is here](http://rpubs.com/walkerke/tigris01), but here's a short overview.
```{r, warning=FALSE, message=FALSE, eval=FALSE}
library(tigris)
library(sp)
# Get census tracts with tracts
dfw <- tracts(state = 'TX', county = c('Dallas', 'Tarrant'))
# Counties with counties. While not always available, they can be found for many places at either 5 meter resolution 20 meter resolution.
texas <- counties(state = 'TX', cb = TRUE, resolution = '20m')
```
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.