/
100-dbi-and-sql.Rmd
277 lines (235 loc) · 11.7 KB
/
100-dbi-and-sql.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
# DBI package and SQL {#chapter_dbi-package-sql}
> This chapter:
>
> * Introduces more DBI functions and demonstrates techniques for submitting SQL to the dbms
> * Illustrates some of the differences between writing `dplyr` commands and SQL
> * Suggests some strategies for dividing the work between your local R session and the dbms
> **This still points to `dvdrental`**
## Setup
The following packages are used in this chapter:
```{r chapter package list, echo=TRUE, message=FALSE, warning=FALSE}
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
```
Otherwise, start your `sql-pet` container:
```{r check on sql-pet}
sqlpetr::sp_docker_start("sql-pet")
```
Connect to the database:
```{r connect to postgresql}
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the following (and preceding) lines,
# bigint become int64 which is a problem for ggplot
bigint = "integer",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "dvdrental"
)
```
## SQL in R Markdown
When you create a report to run repeatedly, you might want to put that query into R markdown. See the discussion of [multiple language engines in R Markdown](https://bookdown.org/yihui/rmarkdown/language-engines.html#sql). That way you can also execute that SQL code in a chunk with the following header:
{`sql, connection=con, output.var = "query_results"`}
```{sql, connection=con, output.var = "query_results"}
SELECT "staff_id", COUNT(*) AS "n"
FROM "rental"
GROUP BY "staff_id";
```
Rmarkdown stored that query result in a tibble:
```{r}
query_results
```
## DBI Package
In this chapter we touched on a number of functions from the DBI Package. The table in file 96b shows other functions in the package. The Chapter column references a section in the book if we have used it.
```{r}
film_table <- tbl(con, "film")
```
### Retrieve the whole table
SQL code that is submitted to a database is evaluated all at once^[From R's perspective. Actually there are 4 steps behind the scenes.]. To think through an SQL query, you can use `dplyr` to build it up step by step and then convert it to SQL code. Or you can use an IDE such as [pgAdmin](https://www.pgadmin.org/) to develop your SQL code. Once you have the SQL code, the following R code demonstrates how to use `dbSendQuery` to submit SQL from your R environment.
```{r}
result_set <- DBI::dbSendQuery(con, 'SELECT "title", "rental_duration", "length"
FROM "film"
WHERE ("rental_duration" > 5.0 AND "length" > 117.0)')
long_rental_films <- DBI::dbFetch(result_set)
str(long_rental_films)
DBI::dbClearResult(result_set)
```
The `dbFetch` function returns a `data.frame`, so you don't have `dplyr`'s guardrails that manage the amount of data returned to your workspace. You need to manage the amount of data yourself, using the `n = ` argument of `dbFetch` to specify the maximum number of records to retrieve per fetch. In the code above, we did not specify `n`, so `dbFetch` returned _all_ pending records as the default behavior.
When you are finished using the result set object, remember to free all of the associated resources with `dbClearResult`, as shown in the code above for the `result_set` variable.
### Or a chunk at a time
The following code demonstrates using the `n` argument to `dbFetch` to specify the maximum number of rows to return. Normally, you would pick some fixed number of records to return each time, but this code shows that you can vary the number of records returned by each call to `dbFetch`.
```{r}
result_set <- dbSendQuery(con, 'SELECT "title", "rental_duration", "length"
FROM "film"
WHERE ("rental_duration" > 5.0 AND "length" > 117.0)')
set.seed(5439)
chunk_num <- 0
while (!dbHasCompleted(result_set)) {
chunk_num <- chunk_num + 1
chunk <- dbFetch(result_set, n = sample(7:13,1))
# print(nrow(chunk))
chunk$chunk_num <- chunk_num
if (!chunk_num %% 9) {print(chunk)}
}
dbClearResult(result_set)
```
## Dividing the work between R on your machine and the DBMS
They work together.
### Make the server do as much work as you can
* show_query as a first draft of SQL. May or may not use SQL code submitted directly.
### Criteria for choosing between `dplyr` and native SQL
This probably belongs later in the book.
* performance considerations: first get the right data, then worry about performance
* Trade offs between leaving the data in PostgreSQL vs what's kept in R:
+ browsing the data
+ larger samples and complete tables
+ using what you know to write efficient queries that do most of the work on the server
Where you place the `collect` function matters.
Here is a typical string of dplyr verbs strung together with the magrittr `%>%` pipe command that will be used to tease out the several different behaviors that a lazy query has when passed to different R functions. This query joins three connection objects into a query we'll call `Q`:
```{r}
rental_table <- dplyr::tbl(con, "rental")
staff_table <- dplyr::tbl(con, "staff")
# the 'staff' table has 2 rows
customer_table <- dplyr::tbl(con, "customer")
# the 'customer' table has 599 rows
Q <- rental_table %>%
dplyr::left_join(staff_table, by = c("staff_id" = "staff_id")) %>%
dplyr::rename(staff_email = email) %>%
dplyr::left_join(customer_table, by = c("customer_id" = "customer_id")) %>%
dplyr::rename(customer_email = email) %>%
dplyr::select(rental_date, staff_email, customer_email)
```
```{r}
Q %>% dplyr::show_query()
```
Here is the SQL query formatted for readability:
```
SELECT "rental_date",
"staff_email",
"customer_email"
FROM (SELECT "rental_id",
"rental_date",
"inventory_id",
"customer_id",
"return_date",
"staff_id",
"last_update.x",
"first_name.x",
"last_name.x",
"address_id.x",
"staff_email",
"store_id.x",
"active.x",
"username",
"password",
"last_update.y",
"picture",
"store_id.y",
"first_name.y",
"last_name.y",
"email" AS "customer_email",
"address_id.y",
"activebool",
"create_date",
"last_update",
"active.y"
FROM (SELECT "TBL_LEFT"."rental_id" AS "rental_id",
"TBL_LEFT"."rental_date" AS "rental_date",
"TBL_LEFT"."inventory_id" AS "inventory_id",
"TBL_LEFT"."customer_id" AS "customer_id",
"TBL_LEFT"."return_date" AS "return_date",
"TBL_LEFT"."staff_id" AS "staff_id",
"TBL_LEFT"."last_update.x" AS "last_update.x",
"TBL_LEFT"."first_name" AS "first_name.x",
"TBL_LEFT"."last_name" AS "last_name.x",
"TBL_LEFT"."address_id" AS "address_id.x",
"TBL_LEFT"."staff_email" AS "staff_email",
"TBL_LEFT"."store_id" AS "store_id.x",
"TBL_LEFT"."active" AS "active.x",
"TBL_LEFT"."username" AS "username",
"TBL_LEFT"."password" AS "password",
"TBL_LEFT"."last_update.y" AS "last_update.y",
"TBL_LEFT"."picture" AS "picture",
"TBL_RIGHT"."store_id" AS "store_id.y",
"TBL_RIGHT"."first_name" AS "first_name.y",
"TBL_RIGHT"."last_name" AS "last_name.y",
"TBL_RIGHT"."email" AS "email",
"TBL_RIGHT"."address_id" AS "address_id.y",
"TBL_RIGHT"."activebool" AS "activebool",
"TBL_RIGHT"."create_date" AS "create_date",
"TBL_RIGHT"."last_update" AS "last_update",
"TBL_RIGHT"."active" AS "active.y"
FROM (SELECT "rental_id",
"rental_date",
"inventory_id",
"customer_id",
"return_date",
"staff_id",
"last_update.x",
"first_name",
"last_name",
"address_id",
"email" AS "staff_email",
"store_id",
"active",
"username",
"password",
"last_update.y",
"picture"
FROM (SELECT "TBL_LEFT"."rental_id" AS "rental_id",
"TBL_LEFT"."rental_date" AS
"rental_date",
"TBL_LEFT"."inventory_id" AS
"inventory_id",
"TBL_LEFT"."customer_id" AS
"customer_id",
"TBL_LEFT"."return_date" AS
"return_date",
"TBL_LEFT"."staff_id" AS "staff_id",
"TBL_LEFT"."last_update" AS
"last_update.x",
"TBL_RIGHT"."first_name" AS "first_name"
,
"TBL_RIGHT"."last_name" AS "last_name",
"TBL_RIGHT"."address_id" AS "address_id",
"TBL_RIGHT"."email" AS "email",
"TBL_RIGHT"."store_id" AS "store_id",
"TBL_RIGHT"."active" AS "active",
"TBL_RIGHT"."username" AS "username",
"TBL_RIGHT"."password" AS "password",
"TBL_RIGHT"."last_update" AS "last_update.y",
"TBL_RIGHT"."picture" AS "picture"
FROM "rental" AS "TBL_LEFT"
LEFT JOIN "staff" AS "TBL_RIGHT"
ON ( "TBL_LEFT"."staff_id" =
"TBL_RIGHT"."staff_id" ))
"ymdofxkiex") "TBL_LEFT"
LEFT JOIN "customer" AS "TBL_RIGHT"
ON ( "TBL_LEFT"."customer_id" =
"TBL_RIGHT"."customer_id" ))
"exddcnhait") "aohfdiedlb"
```
Hand-written SQL code to do the same job will probably look a lot nicer and could be more efficient, but functionally `dplyr` does the job.
```{r}
GQ <- dbGetQuery(
con,
"select r.rental_date, s.email staff_email,c.email customer_email
from rental r
left outer join staff s on r.staff_id = s.staff_id
left outer join customer c on r.customer_id = c.customer_id
"
)
```
But because `Q` hasn't been executed, we can add to it. This behavior is the basis for a useful debugging and development process where queries are built up incrementally.
Where you place the `collect` function matters.
## Disconnect from the database and stop Docker
```{r}
dbDisconnect(con)
sp_docker_stop("sql-pet")
```