practice-day3-wrangling.qmd

---
title: "Wrangling data practice"
---


```{r setup, include=FALSE}
packages <- c("tidyverse", "lubridate", "rvest", "httr", "remotes", "readxl", "janitor", "learnr")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())), repos = "https://cran.us.r-project.org")  
}

library(tidyverse)
library(learnr)
library(lubridate)
library(readxl)
library(janitor)

#myurl <- "https://www.cdc.gov/nchs/data/data_acces_files/NCHSURCodes2013.xlsx"
#download.file(myurl, (tf1 <- tempfile(fileext = ".xlsx")), mode = "wb")

#designations <- readxl::read_excel(tf1)
df <- read_csv("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.csv")


#county_pop <- read_csv("data/county_population.csv")
county_pop <- read_csv("https://www.andrewbatran.com/data/county_population.csv")

df_new <- df %>% 
  mutate(GEOID=str_c(fipsStateCode, fipsCountyCode))

joined_new <- left_join(df_new, county_pop, by="GEOID")
```


## Intro


Let's say you get some FEMA disaster declaration data. 


```{r import_og, warning=F, message=F}
df <- read_csv("https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.csv")

glimpse(df)
```


But there are only so many different ways to slice the data based on the variables in the original data set.

So we have to be creative and think of what other variables we can add. That's the whole perk of being a journalist. Not accepting what you've got but thinking what you can bring to it.

Let's bring in county population data from the U.S. Census so we can estimate how many people were affected by different disasters.

For this exercise, the data's already pre-loaded but these are the commands to bring it in yourself using the excellent package called [**tidycensus**](https://learn.r-journalism.com/en/mapping/static_maps/static-maps/) (you'll need to sub in your own Census [API key](https://api.census.gov/data/key_signup.html)). Usually, you'd have to search for and download this from data.census.gov and do some manual cleaning up before using. But using this package that interfaces with the Census API using the lines of code below gives you the data cleaned up and in a **tidy** format.

```{r tidycensus, eval=FALSE}
# this is the code to bring in the data but it's already preloaded in this tutorial
library(tidycensus)
# census_api_key("API_KEY_GOES_HERE")
census_api_key("627062a9e52e5258145a638f744f779bd0ebb60f")


county_pop <- get_acs(geography="county", variables="B01003_001", year=2020)
```

Let's take a moment to talk about the significance of **tidy** data. It's the prefix to a lot of the packages and functions we're using in this class. But it's also a philosophy when approaching the structure of data.

There's an ideal structure for how to stack your data.

And that's with 

1. Each **variable** is in its own **column**
2. Each **case** is in its own **row**
3. Each **value** is in its own **cell**

Let's take a look at the new `county_pop` data frame we imported from the Census API.

```{r preview_county_pop, eval=T}
# use the function on the object you just imported from the Census API
```

```{r preview_county_pop-hint, eval=F}
The function starts with a *g*
```


Pay attention to the column names and what kind of data is in each column.

Next, let's take a look at our original FEMA data set (`df`).


```{r preview2, eval=T}
______(df)
```

```{r preview2-hint, eval=F}
The function starts with a *g*
```

### In each data set, which columns share the most similarity for locations? 

* GEOID in df and placeCode in county_pop
* designatedArea in df and NAME in county_pop
* variable in county_pop and id in df
* estimate in county_pop and disasterNumber in county_pop


## Joins

A join combines two data sets by adding the columns of one data set alongside the columns of the other, usually some of the rows of the second data set along with some rows of the first data set.

A successful join requires something consistent between two data sets to match on: keys.

The function that's used most often is `left_join()` because you have one main data set you'd like to supplement with additional columns.

Here's how that looks in action:

```{r left-join-image, out.width = "400px", echo=F}
knitr::include_graphics("images/left-join.gif")
```

The function works like this:


**If the two data frames you want to join have the same name:**

```
left_join(data_frame1, data_frame2, by="shared_column_name")
```

**If the two data frames you want to join have *different* names:**

```
left_join(data_frame1, data_frame2, by=c("df1_column"="df_2_column"))
```

Notice that the `c()` argument in the second example is different from how we've used it before as combine. The `=` column matching operator is specific to `_join()` functions. Type `?left_join()` in the R console to see all the other arguments you can use.

Now there are a few other joins that have their uses.

* [right_join()](https://github.com/gadenbuie/tidyexplain/blob/main/images/right-join.gif)
* [full_join()](https://github.com/gadenbuie/tidyexplain/blob/main/images/full-join.gif)
* [semi_join()](https://github.com/gadenbuie/tidyexplain/raw/main/images/semi-join.gif)
* [anti_join()](https://github.com/gadenbuie/tidyexplain/raw/main/images/anti-join.gif)

So let's try to create a new dataframe object starting with the disaster declarations of `df`.

If you looked at the two dataframes in the last exercise, you saw that there were similarities in the county names.

```{r glimpse_again}
glimpse(df)
glimpse(county_pop)
```

Try the `left_join()` function below using the correct syntax and columns you identified.

```{r left_join, eval=T}
joined <- left_join(df, __________, by=_____________)

glimpse(joined)              
```

```{r left_join-hint, eval=F}
Look at the example on how to join data frames with different column names above.
```

Alright, did this work?

We started out with 24 columns in `df` and now have 28 in the newly created `joined` data frame.

So columns were added. But did the data come with it?

When you scroll to the bottom of the `glimpse()` output you see a bunch of `NA`s.


```{r fail-image, out.width = "400px", echo=F}
knitr::include_graphics("images/fail.png")
```

### So what happened?

let's take a closer look at the first five data points in the two columns we joined on:

```{r closer, eval=TRUE}
df %>% 
  select(designatedArea) %>% 
  slice(___) %>% 
  pull(designatedArea)

county_pop %>% 
  select(NAME) %>% 
  slice(___) %>% 
  pull(NAME)
```

```{r closer-hint, eval=F}
Don't forget the operator that represents "through"
```


Alright, so even though they both contain county names the syntax is completely different.

The `df` data frame has parentheses around "County" and the `county_pop` data frame has a comma followed by the state names.

This is why the join ultimately failed. 

It's quite deceptive. You ran the code and didn't get an error.

This is why it's so important to get into the habit of checking for `NA`s after a join or inspecting the new data frame.

Failed joins have thrown off [many](https://www.thekeycuts.com/dear-analyst-a-spreadsheet-error-that-potentially-led-to-incorrect-economic-and-austerity-policies-after-2008-recession/) data analyses and will continue to do so.

**How to join these data sets**

The best way to join data is using a uniform identification number.

For the Census, they have standardized county numbers called GEOIDS. These geographical entities [also exist](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html) for census tracts and states and other Census boundaries.

So the `county_pop` data frame has a column called `GEOID` -- that's perfect!

It looks like `df` has a column called `fipsCountyCode` but if you joined on those two columns, you'd still fail.

That's because `GEOID` in `county_pop` is 5 characters wide and `fipsCountyCode` in `df` is 3 characters wide.

Data is rarely ready to join straight out the box. 

It will take some more wrangling to get these data sets to join.

If you've had some experience with working with Census data then you know a county GEOID has 5 characters.

Therefore we need to transform `fipsCountyCode` in `df` by adding `fipsStateCode` in front of it.

To do that, we'll use a new function from a new package, [**stringr**](https://stringr.tidyverse.org/).

## stringr intro

There are many wonderful functions in the [**stringr**](https://stringr.tidyverse.org/) package that you do things like detect patterns, see if strings start with with a pattern, or split or join or substitute strings. 

In this instance, we need to combine strings.

We'll use the `str_c()` function. *Get it? It's short for String Combine.*

Using `mutate()` we'll also name the new column the same one in the `county_pop` so it's easier to join.

```{r str_c, eval=T}
#library(stringr)
#if you've loaded tidyverse, you've already loaded stringr

df_new <- df %>% 
  mutate(GEOID=____(fipsStateCode, fipsCountyCode))

df_new %>% 
  select(fipsStateCode, fipsCountyCode, GEOID) %>% 
  glimpse()
```

```{r str_c-hint, eval=F}
function starts with an *s*.
```

Alright! Now let's join `df_new` and `county_pop` like before.

```{r left_join2, eval=T}
joined_new <- _____join(df_new, __________, by=_______)

glimpse(joined_new)              
```

```{r left_join2-hint, eval=F}
Look at the example on how to join data frames with different column names above.
```


Aha! We did it! 

Let's do some quick analysis on it.

## Summarize

Now that we have population data with every declared disaster, let's see which 5 disaster type affected the most people in 2021 (sorted high to low).

Fill in the missing code.

```{r summarize, eval=T}
joined_new %>% 
  mutate(year=____(incidentBeginDate)) %>% 
  filter(______) %>% 
  group_by(______) %>% 
  summarize(population=sum(estimate, na.rm=T)) %>% 
  arrange(______) %>% 
  slice(___)
```

```{r summarize-hint, eval=F}
This is a summation of all the functions you've used so far!
```

### Types of data

* **Categorical variables** are descriptive labels given to individual records, assigning them to different groups. The simplest categorical data is dichotomous, meaning that there are just two possible groups — in an election, for instance, people either voted, or they did not. More commonly, there are multiple categories. When analyzing traffic accidents, for example, you might consider the day of the week on which each incident occurred, giving seven possible categories.
* **Continuous data** is richer, consisting of numbers that can have a range of values on a sliding scale. When working with weather data, for instance, continuous variables might include temperature and amount of rainfall.

We also often need to consider date and time, which can be treated as continuous, like a sequence of years; or categorical, like the days of the week. A common task in data stories is to consider how the values for a variable or variables have changed over time.

### Interviewing data

The goal is to get used to asking questions of data by performing the following basic operations with the functions you've learned:

* **Sort**: Largest to smallest, oldest to newest, alphabetical etc.
* **Filter**: Select a defined subset of the data.
* **Summarize**: Derive one value from a series of other values to produce a summary statistic. Examples include:
  * **Count**. The number of records.
  * **Sum**. Add the values of a continuous variable.
  * **Mean** (aka average). The sum of values for a continuous variable divided by the count.
  * **Median**. The value in the middle, if the values for a continuous variable are sorted in ascending or descending order of magnitude.
  * **Max**, **Min**. The largest and smallest value for a continuous value, respectively.
* Math: Move the summarized data into a new sheet for additional analysis

Often you will **group by** a *categorical* variable first, and then summarize a *continuous* variable for each category.

Let's try to summarize a different way. We added up the population for all incident types in 2021. 

But this time, let's find the average and median population affected by all incident types in the entire data set.

Arrange it high to low (on avg_pop) and slice out the top 5 rows.

```{r summarize_again, eval=T}
joined_new %>% 
  group_by(___________) %>% 
  summarize(declarations=___,
            avg_pop=____(estimate, na.rm=T),
            median_pop=______(estimate, na.rm=T)) %>% 
  arrange(___________) %>% 
  slice(___)
```

```{r summarize_again-hint, eval=F}
the function to find average is mean()
```

Pretty interesting, right? 

I don't know if this could lead to a story because the top three incident types that affected the highest average amount of people occurred so rarely. 

## Newsroom math

Nearly every news story that involves data analysis can be derived from one these formulas.

* Difference
  * x - y
* Percent
  * x / (x + y) * 100
* Percent change
  * (new - old)/old * 100
* Per Capita
  * x / population * some multiplier to raise result to a round number

So let's say we want to write a story about Kentucky flooding.

One thing we can ask is what has changed? Have things gotten worse or have things improved?

Let's wrangle the data so we can easily answer that.

We'll need to only compare the current months of 2022 with the past months (otherwise we'd be comparing 12 months of data in 2021 to 8 in 2022 which would be misleading).

```{r flooding}
joined_new %>% 
  filter(state=="KY") %>% 
  filter(incidentType=="Flood") %>% 
  mutate(year=year(incidentBeginDate)) %>% 
  # extracting months
  mutate(month=month(incidentBeginDate)) %>% 
  # only paying attention to months in current year of data set
  filter(month %in% c(1:9)) %>% 
  filter(year==2020 | year==2021 | year==2022 | year==2023) %>% 
  group_by(year) %>% 
  summarize(declarations=n(),
            avg_pop=mean(estimate, na.rm=T),
            median_pop=median(estimate, na.rm=T))
```

How many more county Flood declarations were there in Kentucky in 2021 compared to 2022? Figure this out using your console.

____?

Great job so far.

Before we try out more math we'll need to learn more techniques to transform the data.


## tidyr

You need to understand the basics of math to tell a story.

Let's say you're looking at this data because some local disaster occurred and you want to answer the question:

* Are things worse now than they were before? 
* Which place has it worst and most recently? 
  * Because you can go visit that place and find victims to anchor the story narratively
  
Being able to come up with types of questions and answer them yourself using raw data will help you stand apart from the competition. 

Because you're working with raw data intended for use by an agency for one thing, you'll need to be able to reshape the data so you can do your own analysis, which will include math (such as difference, percents, percent change, and per capita).

One advanced technique for transforming data you'll learn in this section is from the **tidyr** package.

* `pivot_wider()`
* `pivot_longer()`

Now, these used to be called `gather()` and `spread()` but the language is a bit clearer now.

Pivots in R mean something else entirely than pivots in Excel.

In Excel, pivot tables are used to group and summarize data.

In R, you pivot data as in you reshape it. This way you can do math easier across all rows.

Here's how it works (pay attention to the colors):

```{r pivot-image1, out.width = "400px", echo=F, eval=F}
knitr::include_graphics("images/original-dfs-tidy.png")
```


```{r pivot-image2, out.width = "400px", echo=F}
knitr::include_graphics("images/tidyr-pivoting.gif")
```

Let's start with this data that we last ended up with. 

But this time comment comment out the second line so we include all states with floods this time.

And in the 10th line, add "state" as a second argument in the `group_by()` option after "year".

```{r what, eval=TRUE}
long_flood <- joined_new %>% 
  filter(incidentType=="Flood") %>% 
  mutate(year=year(incidentBeginDate)) %>% 
  # extracting months
  mutate(month=month(incidentBeginDate)) %>% 
  # only paying attention to months in current year of data set
  filter(month %in% c(1:9)) %>% 
  filter(year==2020 | year==2021 | year==2022 | year==2023) %>% 
  group_by(year, _____) %>% 
  summarize(declarations=n(),
            avg_pop=mean(estimate, na.rm=T),
            median_pop=median(estimate, na.rm=T))

long_flood
```

```{r what-hint, eval=F}
Use a # to comment out.
Also, you don't need quotations around column names in
group_by() unless there's a space in the column names.
Multiple arguments in group_by() are separated by a comma.
```


Okay, we have tidy data! Each variable is in its own column. Each case is in its own row and each value is in its own cell.

This makes it easier to mutate and manipulate (and is also the preferable data structure for most data viz tools).

However, if we wanted to compare 2020 declaration counts to 2021 and 2022 in each state, that would be difficult.

## pivot_wider()

So we need to turn this long data into wide data using `pivot_wider()`

You need to identify what column you want to pull the new column names from and which column the values are stored in ("year" and "declarations" respectively. In this specific circumstance ("pivot_") you'll need to put the column names in quotation marks.

We'll also need to drop the `avg_pop` and `median_pop` columns or else the pivot will fail.

To drop columns, you use the `-` subtract sign in the `select()` function.

```{r pivot_wider, eval=TRUE}
wide_flood <- long_flood %>% 
  select(-avg_pop, -median_pop) %>% 
  pivot_wider(names_from="____",
              values_from="____________")

wide_flood
```


```{r pivot_wider-hint, eval=F}
In this instance, you do need to put the column names in quotation marks.
```

Great job! We can clearly see most states had flooding in only one year. 

Kentucky had 50 in 2021 and 20 in 2022 (Okay, the number may change in 2022 depending on when you pull this data and how unlucky Kentucky continues to be). 

We knew that from our last section but now we can see that it's the only state that had floods in both years.

But we can now answer one of our questions:

* Which place has it worst and most recently? 

If you paginate through the results you'll see one state in particular went from no floods in 2021 to 19 in 2022.

Next, we can use `summarize()` and `mutate()` to do some math to answer our first question:

* Are things worse now than they were before? 

Note: Because the column names we care about starts with a number, we need to surround the column names ``like `this` in the code.`` The key to the left of the `1` on the keyboard. This is also what you'd have to do if the column names had spaces in them.

Add up all the floods by year with summarize and then add a column that calculates the percent change between 2022 and 2021.

Don't forget to add the argument that ignores any `NA` values in the `sum()` formula.

```{r wide_flood_summary, eval=TRUE}
flood_percent_change <- wide_flood %>% 
  summarize(`2020`=sum(______, _____=T),
            `2021`=sum(______, _____=T),
            `2022`=sum(______, _____=T)) %>% 
  ______(percent_change=(round((`____`-`____`)/`____`*100,1)))

flood_percent_change
```


```{r wide_flood_summary-hint, eval=F}
The formula for percent change is (new-old)/old*100. Also remember the differences between aggregating a data frame and adding a column to an existing data frame.
```


## pivot_wider() multiple

What's really powerful about `pivot_wider()` is the option to be able to draw values from more than one column.

Adjust the code below:

1. Delete the `-avg_pop` argument in line 2 (and the comma)
2. in `values_from`, instead of "declarations" type in `c("declarations", "avg_pop")

See what happens. *Also, type it in, don't copy and paste otherwise you'll run into syntax issues.*

```{r pivot_wider_more, eval=TRUE}
wide_flood_more <- long_flood %>% 
  select(-avg_pop, -median_pop) %>% 
  pivot_wider(names_from="year",
              values_from="declarations")

wide_flood_more
```

```{r pivot_wider_more-hint, eval=F}
Be careful about syntax.
Also, you still want to pull names from "year"
```


You'll have to paginate a bit but you can see that the declarations and average are now prefixes to the `_2021` etc years.

You could also bring in the median values this way if you want.

Okay, now that we know how to make long data to wide, let's convert wide data to long.

## pivot_longer()

Sometimes you'll get data that looks like this:

```{r fire_data, eval=T}


fires_wide <- joined_new %>% 
  filter(incidentType=="Fire") %>% 
  mutate(year=year(incidentBeginDate)) %>% 
  count(year, NAME) %>% 
  filter(!is.na(NAME)) %>% 
  pivot_wider(names_from="year", values_from=n)
```

These are the declared fire disasters in each county since 1956.

A state or column in one row and every column after is a different year of data.

This is not tidy data. Every year should be its own row. This is fine if we want to calculate changes between individual years.

But if we wanted to visualize this, we'd need to transform it long.

We want to increase the number of rows and decrease the number of columns.

We'll use the `pivot_longer()` function which needs:

1. What columns to focus on `cols=`
2. What to name the column with the names of the columns `names_to=`
3. What to name the column with the values `values_to=`


```{r pivot_longer, eval=TRUE}
# We could do cols=`1967`:`2023`
# or we could do the number of columns in the dataframe as in cols=2:42
# let's do the first option in this one

fires_wide %>% 
  pivot_______(cols=_____________,
               _____to="year",
               _______to="declarations")
  
```


```{r pivot_longer-hint, eval=F}
Make sure you name the arguments correctly.
```


Alright! We did it!

You now have a handle on all the biggest verbs used to wrangle and transform data.

There are many more functions that do more specific things, of course.

But this will hopefully get you started on your way. Everything else you may have questions on how to do has probably been asked and answered out in the R online community.