/
practice-day3-wrangling.qmd
618 lines (391 loc) · 21.8 KB
/
practice-day3-wrangling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
---
title: "Wrangling data practice"
---
```{r setup, include=FALSE}
packages <- c("tidyverse", "lubridate", "rvest", "httr", "remotes", "readxl", "janitor", "learnr")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())), repos = "https://cran.us.r-project.org")
}
library(tidyverse)
library(learnr)
library(lubridate)
library(readxl)
library(janitor)
#myurl <- "https://www.cdc.gov/nchs/data/data_acces_files/NCHSURCodes2013.xlsx"
#download.file(myurl, (tf1 <- tempfile(fileext = ".xlsx")), mode = "wb")
#designations <- readxl::read_excel(tf1)
df <- read_csv("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.csv")
#county_pop <- read_csv("data/county_population.csv")
county_pop <- read_csv("https://www.andrewbatran.com/data/county_population.csv")
df_new <- df %>%
mutate(GEOID=str_c(fipsStateCode, fipsCountyCode))
joined_new <- left_join(df_new, county_pop, by="GEOID")
```
## Intro
Let's say you get some FEMA disaster declaration data.
```{r import_og, warning=F, message=F}
df <- read_csv("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.csv")
glimpse(df)
```
But there are only so many different ways to slice the data based on the variables in the original data set.
So we have to be creative and think of what other variables we can add. That's the whole perk of being a journalist. Not accepting what you've got but thinking what you can bring to it.
Let's bring in county population data from the U.S. Census so we can estimate how many people were affected by different disasters.
For this exercise, the data's already pre-loaded but these are the commands to bring it in yourself using the excellent package called [**tidycensus**](https://learn.r-journalism.com/en/mapping/static_maps/static-maps/) (you'll need to sub in your own Census [API key](https://api.census.gov/data/key_signup.html)). Usually, you'd have to search for and download this from data.census.gov and do some manual cleaning up before using. But using this package that interfaces with the Census API using the lines of code below gives you the data cleaned up and in a **tidy** format.
```{r tidycensus, eval=FALSE}
# this is the code to bring in the data but it's already preloaded in this tutorial
library(tidycensus)
# census_api_key("API_KEY_GOES_HERE")
census_api_key("627062a9e52e5258145a638f744f779bd0ebb60f")
county_pop <- get_acs(geography="county", variables="B01003_001", year=2020)
```
Let's take a moment to talk about the significance of **tidy** data. It's the prefix to a lot of the packages and functions we're using in this class. But it's also a philosophy when approaching the structure of data.
There's an ideal structure for how to stack your data.
And that's with
1. Each **variable** is in its own **column**
2. Each **case** is in its own **row**
3. Each **value** is in its own **cell**
Let's take a look at the new `county_pop` data frame we imported from the Census API.
```{r preview_county_pop, eval=T}
# use the function on the object you just imported from the Census API
```
```{r preview_county_pop-hint, eval=F}
The function starts with a *g*
```
Pay attention to the column names and what kind of data is in each column.
Next, let's take a look at our original FEMA data set (`df`).
```{r preview2, eval=T}
______(df)
```
```{r preview2-hint, eval=F}
The function starts with a *g*
```
### In each data set, which columns share the most similarity for locations?
* GEOID in df and placeCode in county_pop
* designatedArea in df and NAME in county_pop
* variable in county_pop and id in df
* estimate in county_pop and disasterNumber in county_pop
## Joins
A join combines two data sets by adding the columns of one data set alongside the columns of the other, usually some of the rows of the second data set along with some rows of the first data set.
A successful join requires something consistent between two data sets to match on: keys.
The function that's used most often is `left_join()` because you have one main data set you'd like to supplement with additional columns.
Here's how that looks in action:
```{r left-join-image, out.width = "400px", echo=F}
knitr::include_graphics("images/left-join.gif")
```
The function works like this:
**If the two data frames you want to join have the same name:**
```
left_join(data_frame1, data_frame2, by="shared_column_name")
```
**If the two data frames you want to join have *different* names:**
```
left_join(data_frame1, data_frame2, by=c("df1_column"="df_2_column"))
```
Notice that the `c()` argument in the second example is different from how we've used it before as combine. The `=` column matching operator is specific to `_join()` functions. Type `?left_join()` in the R console to see all the other arguments you can use.
Now there are a few other joins that have their uses.
* [right_join()](https://github.com/gadenbuie/tidyexplain/blob/main/images/right-join.gif)
* [full_join()](https://github.com/gadenbuie/tidyexplain/blob/main/images/full-join.gif)
* [semi_join()](https://github.com/gadenbuie/tidyexplain/raw/main/images/semi-join.gif)
* [anti_join()](https://github.com/gadenbuie/tidyexplain/raw/main/images/anti-join.gif)
So let's try to create a new dataframe object starting with the disaster declarations of `df`.
If you looked at the two dataframes in the last exercise, you saw that there were similarities in the county names.
```{r glimpse_again}
glimpse(df)
glimpse(county_pop)
```
Try the `left_join()` function below using the correct syntax and columns you identified.
```{r left_join, eval=T}
joined <- left_join(df, __________, by=_____________)
glimpse(joined)
```
```{r left_join-hint, eval=F}
Look at the example on how to join data frames with different column names above.
```
Alright, did this work?
We started out with 24 columns in `df` and now have 28 in the newly created `joined` data frame.
So columns were added. But did the data come with it?
When you scroll to the bottom of the `glimpse()` output you see a bunch of `NA`s.
```{r fail-image, out.width = "400px", echo=F}
knitr::include_graphics("images/fail.png")
```
### So what happened?
let's take a closer look at the first five data points in the two columns we joined on:
```{r closer, eval=TRUE}
df %>%
select(designatedArea) %>%
slice(___) %>%
pull(designatedArea)
county_pop %>%
select(NAME) %>%
slice(___) %>%
pull(NAME)
```
```{r closer-hint, eval=F}
Don't forget the operator that represents "through"
```
Alright, so even though they both contain county names the syntax is completely different.
The `df` data frame has parentheses around "County" and the `county_pop` data frame has a comma followed by the state names.
This is why the join ultimately failed.
It's quite deceptive. You ran the code and didn't get an error.
This is why it's so important to get into the habit of checking for `NA`s after a join or inspecting the new data frame.
Failed joins have thrown off [many](https://www.thekeycuts.com/dear-analyst-a-spreadsheet-error-that-potentially-led-to-incorrect-economic-and-austerity-policies-after-2008-recession/) data analyses and will continue to do so.
**How to join these data sets**
The best way to join data is using a uniform identification number.
For the Census, they have standardized county numbers called GEOIDS. These geographical entities [also exist](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html) for census tracts and states and other Census boundaries.
So the `county_pop` data frame has a column called `GEOID` -- that's perfect!
It looks like `df` has a column called `fipsCountyCode` but if you joined on those two columns, you'd still fail.
That's because `GEOID` in `county_pop` is 5 characters wide and `fipsCountyCode` in `df` is 3 characters wide.
Data is rarely ready to join straight out the box.
It will take some more wrangling to get these data sets to join.
If you've had some experience with working with Census data then you know a county GEOID has 5 characters.
Therefore we need to transform `fipsCountyCode` in `df` by adding `fipsStateCode` in front of it.
To do that, we'll use a new function from a new package, [**stringr**](https://stringr.tidyverse.org/).
## stringr intro
There are many wonderful functions in the [**stringr**](https://stringr.tidyverse.org/) package that you do things like detect patterns, see if strings start with with a pattern, or split or join or substitute strings.
In this instance, we need to combine strings.
We'll use the `str_c()` function. *Get it? It's short for String Combine.*
Using `mutate()` we'll also name the new column the same one in the `county_pop` so it's easier to join.
```{r str_c, eval=T}
#library(stringr)
#if you've loaded tidyverse, you've already loaded stringr
df_new <- df %>%
mutate(GEOID=____(fipsStateCode, fipsCountyCode))
df_new %>%
select(fipsStateCode, fipsCountyCode, GEOID) %>%
glimpse()
```
```{r str_c-hint, eval=F}
function starts with an *s*.
```
Alright! Now let's join `df_new` and `county_pop` like before.
```{r left_join2, eval=T}
joined_new <- _____join(df_new, __________, by=_______)
glimpse(joined_new)
```
```{r left_join2-hint, eval=F}
Look at the example on how to join data frames with different column names above.
```
Aha! We did it!
Let's do some quick analysis on it.
## Summarize
Now that we have population data with every declared disaster, let's see which 5 disaster type affected the most people in 2021 (sorted high to low).
Fill in the missing code.
```{r summarize, eval=T}
joined_new %>%
mutate(year=____(incidentBeginDate)) %>%
filter(______) %>%
group_by(______) %>%
summarize(population=sum(estimate, na.rm=T)) %>%
arrange(______) %>%
slice(___)
```
```{r summarize-hint, eval=F}
This is a summation of all the functions you've used so far!
```
### Types of data
* **Categorical variables** are descriptive labels given to individual records, assigning them to different groups. The simplest categorical data is dichotomous, meaning that there are just two possible groups — in an election, for instance, people either voted, or they did not. More commonly, there are multiple categories. When analyzing traffic accidents, for example, you might consider the day of the week on which each incident occurred, giving seven possible categories.
* **Continuous data** is richer, consisting of numbers that can have a range of values on a sliding scale. When working with weather data, for instance, continuous variables might include temperature and amount of rainfall.
We also often need to consider date and time, which can be treated as continuous, like a sequence of years; or categorical, like the days of the week. A common task in data stories is to consider how the values for a variable or variables have changed over time.
### Interviewing data
The goal is to get used to asking questions of data by performing the following basic operations with the functions you've learned:
* **Sort**: Largest to smallest, oldest to newest, alphabetical etc.
* **Filter**: Select a defined subset of the data.
* **Summarize**: Derive one value from a series of other values to produce a summary statistic. Examples include:
* **Count**. The number of records.
* **Sum**. Add the values of a continuous variable.
* **Mean** (aka average). The sum of values for a continuous variable divided by the count.
* **Median**. The value in the middle, if the values for a continuous variable are sorted in ascending or descending order of magnitude.
* **Max**, **Min**. The largest and smallest value for a continuous value, respectively.
* Math: Move the summarized data into a new sheet for additional analysis
Often you will **group by** a *categorical* variable first, and then summarize a *continuous* variable for each category.
Let's try to summarize a different way. We added up the population for all incident types in 2021.
But this time, let's find the average and median population affected by all incident types in the entire data set.
Arrange it high to low (on avg_pop) and slice out the top 5 rows.
```{r summarize_again, eval=T}
joined_new %>%
group_by(___________) %>%
summarize(declarations=___,
avg_pop=____(estimate, na.rm=T),
median_pop=______(estimate, na.rm=T)) %>%
arrange(___________) %>%
slice(___)
```
```{r summarize_again-hint, eval=F}
the function to find average is mean()
```
Pretty interesting, right?
I don't know if this could lead to a story because the top three incident types that affected the highest average amount of people occurred so rarely.
## Newsroom math
Nearly every news story that involves data analysis can be derived from one these formulas.
* Difference
* x - y
* Percent
* x / (x + y) * 100
* Percent change
* (new - old)/old * 100
* Per Capita
* x / population * some multiplier to raise result to a round number
So let's say we want to write a story about Kentucky flooding.
One thing we can ask is what has changed? Have things gotten worse or have things improved?
Let's wrangle the data so we can easily answer that.
We'll need to only compare the current months of 2022 with the past months (otherwise we'd be comparing 12 months of data in 2021 to 8 in 2022 which would be misleading).
```{r flooding}
joined_new %>%
filter(state=="KY") %>%
filter(incidentType=="Flood") %>%
mutate(year=year(incidentBeginDate)) %>%
# extracting months
mutate(month=month(incidentBeginDate)) %>%
# only paying attention to months in current year of data set
filter(month %in% c(1:9)) %>%
filter(year==2020 | year==2021 | year==2022 | year==2023) %>%
group_by(year) %>%
summarize(declarations=n(),
avg_pop=mean(estimate, na.rm=T),
median_pop=median(estimate, na.rm=T))
```
How many more county Flood declarations were there in Kentucky in 2021 compared to 2022? Figure this out using your console.
____?
Great job so far.
Before we try out more math we'll need to learn more techniques to transform the data.
## tidyr
You need to understand the basics of math to tell a story.
Let's say you're looking at this data because some local disaster occurred and you want to answer the question:
* Are things worse now than they were before?
* Which place has it worst and most recently?
* Because you can go visit that place and find victims to anchor the story narratively
Being able to come up with types of questions and answer them yourself using raw data will help you stand apart from the competition.
Because you're working with raw data intended for use by an agency for one thing, you'll need to be able to reshape the data so you can do your own analysis, which will include math (such as difference, percents, percent change, and per capita).
One advanced technique for transforming data you'll learn in this section is from the **tidyr** package.
* `pivot_wider()`
* `pivot_longer()`
Now, these used to be called `gather()` and `spread()` but the language is a bit clearer now.
Pivots in R mean something else entirely than pivots in Excel.
In Excel, pivot tables are used to group and summarize data.
In R, you pivot data as in you reshape it. This way you can do math easier across all rows.
Here's how it works (pay attention to the colors):
```{r pivot-image1, out.width = "400px", echo=F, eval=F}
knitr::include_graphics("images/original-dfs-tidy.png")
```
```{r pivot-image2, out.width = "400px", echo=F}
knitr::include_graphics("images/tidyr-pivoting.gif")
```
Let's start with this data that we last ended up with.
But this time comment comment out the second line so we include all states with floods this time.
And in the 10th line, add "state" as a second argument in the `group_by()` option after "year".
```{r what, eval=TRUE}
long_flood <- joined_new %>%
filter(incidentType=="Flood") %>%
mutate(year=year(incidentBeginDate)) %>%
# extracting months
mutate(month=month(incidentBeginDate)) %>%
# only paying attention to months in current year of data set
filter(month %in% c(1:9)) %>%
filter(year==2020 | year==2021 | year==2022 | year==2023) %>%
group_by(year, _____) %>%
summarize(declarations=n(),
avg_pop=mean(estimate, na.rm=T),
median_pop=median(estimate, na.rm=T))
long_flood
```
```{r what-hint, eval=F}
Use a # to comment out.
Also, you don't need quotations around column names in
group_by() unless there's a space in the column names.
Multiple arguments in group_by() are separated by a comma.
```
Okay, we have tidy data! Each variable is in its own column. Each case is in its own row and each value is in its own cell.
This makes it easier to mutate and manipulate (and is also the preferable data structure for most data viz tools).
However, if we wanted to compare 2020 declaration counts to 2021 and 2022 in each state, that would be difficult.
## pivot_wider()
So we need to turn this long data into wide data using `pivot_wider()`
You need to identify what column you want to pull the new column names from and which column the values are stored in ("year" and "declarations" respectively. In this specific circumstance ("pivot_") you'll need to put the column names in quotation marks.
We'll also need to drop the `avg_pop` and `median_pop` columns or else the pivot will fail.
To drop columns, you use the `-` subtract sign in the `select()` function.
```{r pivot_wider, eval=TRUE}
wide_flood <- long_flood %>%
select(-avg_pop, -median_pop) %>%
pivot_wider(names_from="____",
values_from="____________")
wide_flood
```
```{r pivot_wider-hint, eval=F}
In this instance, you do need to put the column names in quotation marks.
```
Great job! We can clearly see most states had flooding in only one year.
Kentucky had 50 in 2021 and 20 in 2022 (Okay, the number may change in 2022 depending on when you pull this data and how unlucky Kentucky continues to be).
We knew that from our last section but now we can see that it's the only state that had floods in both years.
But we can now answer one of our questions:
* Which place has it worst and most recently?
If you paginate through the results you'll see one state in particular went from no floods in 2021 to 19 in 2022.
Next, we can use `summarize()` and `mutate()` to do some math to answer our first question:
* Are things worse now than they were before?
Note: Because the column names we care about starts with a number, we need to surround the column names ``like `this` in the code.`` The key to the left of the `1` on the keyboard. This is also what you'd have to do if the column names had spaces in them.
Add up all the floods by year with summarize and then add a column that calculates the percent change between 2022 and 2021.
Don't forget to add the argument that ignores any `NA` values in the `sum()` formula.
```{r wide_flood_summary, eval=TRUE}
flood_percent_change <- wide_flood %>%
summarize(`2020`=sum(______, _____=T),
`2021`=sum(______, _____=T),
`2022`=sum(______, _____=T)) %>%
______(percent_change=(round((`____`-`____`)/`____`*100,1)))
flood_percent_change
```
```{r wide_flood_summary-hint, eval=F}
The formula for percent change is (new-old)/old*100. Also remember the differences between aggregating a data frame and adding a column to an existing data frame.
```
## pivot_wider() multiple
What's really powerful about `pivot_wider()` is the option to be able to draw values from more than one column.
Adjust the code below:
1. Delete the `-avg_pop` argument in line 2 (and the comma)
2. in `values_from`, instead of "declarations" type in `c("declarations", "avg_pop")
See what happens. *Also, type it in, don't copy and paste otherwise you'll run into syntax issues.*
```{r pivot_wider_more, eval=TRUE}
wide_flood_more <- long_flood %>%
select(-avg_pop, -median_pop) %>%
pivot_wider(names_from="year",
values_from="declarations")
wide_flood_more
```
```{r pivot_wider_more-hint, eval=F}
Be careful about syntax.
Also, you still want to pull names from "year"
```
You'll have to paginate a bit but you can see that the declarations and average are now prefixes to the `_2021` etc years.
You could also bring in the median values this way if you want.
Okay, now that we know how to make long data to wide, let's convert wide data to long.
## pivot_longer()
Sometimes you'll get data that looks like this:
```{r fire_data, eval=T}
fires_wide <- joined_new %>%
filter(incidentType=="Fire") %>%
mutate(year=year(incidentBeginDate)) %>%
count(year, NAME) %>%
filter(!is.na(NAME)) %>%
pivot_wider(names_from="year", values_from=n)
```
These are the declared fire disasters in each county since 1956.
A state or column in one row and every column after is a different year of data.
This is not tidy data. Every year should be its own row. This is fine if we want to calculate changes between individual years.
But if we wanted to visualize this, we'd need to transform it long.
We want to increase the number of rows and decrease the number of columns.
We'll use the `pivot_longer()` function which needs:
1. What columns to focus on `cols=`
2. What to name the column with the names of the columns `names_to=`
3. What to name the column with the values `values_to=`
```{r pivot_longer, eval=TRUE}
# We could do cols=`1967`:`2023`
# or we could do the number of columns in the dataframe as in cols=2:42
# let's do the first option in this one
fires_wide %>%
pivot_______(cols=_____________,
_____to="year",
_______to="declarations")
```
```{r pivot_longer-hint, eval=F}
Make sure you name the arguments correctly.
```
Alright! We did it!
You now have a handle on all the biggest verbs used to wrangle and transform data.
There are many more functions that do more specific things, of course.
But this will hopefully get you started on your way. Everything else you may have questions on how to do has probably been asked and answered out in the R online community.