lesson-day3-01-wrangling.qmd

---
title: "Tidying data"
author: "Andrew Ba Tran"
---


# Joins

Let's start out with two data frames: x and y

```{r dfs}
#| echo: TRUE

x <- data.frame(id=c(1,2,3), x=c("x1", "x2", "x3"))

x
```

:::{.fragment}
```{r}
#| echo: TRUE
y <- data.frame(id=c(1,2,4), y=c("y1", "y2", "y4"))

y
```
:::

**Two data frames**

Here are the two data frames we created color coded.

```{r, echo=F, fig.retina=TRUE, out.width=400}
knitr::include_graphics("images/original-dfs.png")
```


### left_join()

The most typical join is a left join.

The function requires two arguments: The original dataframe and a dataframe to join to it. 

Check out the results:

```{r left, warning=F, message=F}
#| echo: TRUE
library(dplyr)

left_join(x, y)
```

Did you notice the `NA` in the **y** column?

**left_join() illustrated**

Here's why the `NA` shows up.

![](https://ucd-cws.github.io/CABW2020_R_training/images/left-join.gif)

Because there is no 4 value to join with in the *original* dataframe.

Alright, that was easy because both dataframes had the same name to join on.

But what happens if the data you want to join on have different column names?

**Two data frames: x and y but with different column names**

```{r dfs_again}
#| echo: TRUE
x <- data.frame(id=c(1,2,3), x=c("x1", "x2", "x3"))

x
```

::: {.fragment}
```{r}
#| echo: TRUE
y <- data.frame(new_id=c(1,2,4), y=c("y1", "y2", "y4"))

y
```
:::

So the two columns are no longer both `id` but instead one is `id` and the other one is `new_id`.

Add the argument `by=c()`...

:::{.fragment}
```{r left2}
#| echo: TRUE
left_join(x, y, by=c("id"="new_id"))
```
:::

Easy!

But...

**Watch out for repeated data**

Dataframe 1:

```{r left3}
#| echo: FALSE
x <- data.frame(id=c(1,2,3), 
                x=c("x1", "x2", "x3"))

x
```

Dataframe 2: There are multiple 2s in id.

```{r left4}
#| echo: FALSE
y <- data.frame(id=c(1,2,4,2), 
                y=c("y1", "y2", "y4", "y5"))

y
```

So when you join...

:::{.fragment}
```{r}
#| echo: TRUE
left_join(x, y)
```
:::

This could be bad if you were expecting only 1 row of each!!

**Extra rows illustrated**

![](https://github.com/gadenbuie/tidyexplain/raw/main/images/left-join-extra.gif)

So be careful. Always look at the results of your join to make sure it didn't mess up.

Now, here's another version of a join, but this time in the other direction.

### right_join()


![](https://github.com/gadenbuie/tidyexplain/raw/main/images/right-join.gif)

And here are a bunch of other joins for reference:

### full_join()

![](https://github.com/gadenbuie/tidyexplain/raw/main/images/full-join.gif)

### inner_join()

![](https://github.com/gadenbuie/tidyexplain/raw/main/images/inner-join.gif)

### anti_join()

![](https://github.com/gadenbuie/tidyexplain/raw/main/images/anti-join.gif)

## stringr package

Let's go over some strategies on cleaning up text using the [**stringr**](https://stringr.tidyverse.org/) package.

## stringr functions

Key `stringr` functions:

In this section, we will learn the following `stringr` functions:

:::{.incremental}
* `str_to_upper()` `str_to_lower()` `str_to_title()`
* `str_trim()` `str_squish()`
* `str_c()`
* `str_detect()`
* `str_subset()`
* `str_sub()`
:::

## stringr in action

```{r str_to}
#| echo: TRUE

library(stringr)

test_text <- "tHiS iS A rANsOM noTE!"
```

:::{.fragment}
```{r}
#| echo: TRUE
str_to_upper(test_text)
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_to_lower(test_text)
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_to_title(test_text)
```
:::

## Trimming strings

```{r trim}
#| echo: TRUE
test_text <- "  trim both   "

test_text 
```

:::{.fragment}
```{r}
#| echo: TRUE
str_trim(test_text, side="both")
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_trim(test_text, side="left")
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_trim(test_text, side="right")
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
messy_text <- "  sometimes  you get   this "
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_squish(messy_text)
```
:::

## Combining strings

```{r str_c}
#| echo: TRUE
text_a <- "one"

text_b <- "two"

text_a

text_b
```

:::{.fragment}
```{r}
#| echo: TRUE
str_c(text_a, text_b)
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_c(text_a, text_b, sep="-")
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_c(text_a, text_b, sep=" and a ")
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_c(text_a, " and a ", text_b)
```
:::

## Extracting strings

```{r extract}
#| echo: TRUE
test_text <- "Hello world"

test_text 
```

:::{.fragment}
```{r}
#| echo: TRUE
str_sub(test_text, start = 6)
```
:::

:::{.fragment}
```{r}
#| echo: TRUE
str_sub(test_text, end = 5) <- "Howdy"

test_text
```
:::


:::{.fragment}
```{r}
#| echo: TRUE
cn <- "Kemp County, Georgia"

cn 

str_replace(cn, " County, .*", "")
```
:::

## More stringr functions
More functions in [stringr](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) and more info on regular expressions [here](https://raw.githubusercontent.com/rstudio/cheatsheets/main/regex.pdf).

## parse_number()

A really important function that will help format your numbers.

(from the readr package)

```{r parse}
#| echo: TRUE

library(readr)
messy_numbers <- c("$5.00", "9,343,200", "6.0%")

messy_numbers
```

:::{.fragment}
```{r}
#| echo: TRUE
parse_number(messy_numbers)
```
:::


## parse_number()

![](images/parse_number.png)


## Your turn

**practice-day3-wrangling**

Get as far as you can in the time we have! 

## Tidying data

**Sample data**

(You don't have to type this out)

2 rows x 3 columns

```{r dfs2}
#| echo: TRUE
df <- data.frame(id=c(1,2), x=c("a", "b"),
                 y=c("c", "d"), z=c("e", "f"))

df
```

Sometimes we need to transform our data to analyze it more effectively.

Pay attention to the colors and the values below...

## wide vs long

![](images/original-dfs-tidy.png)

## pivot_longer() illustrated

![](https://github.com/gadenbuie/tidyexplain/raw/main/images/tidyr-pivoting.gif)

## pivot_longer()

Here's how to turn a wide dataframe into a long one.

(Why would you want to do this? Because it's easier to do group_bys and summarize and visualizations)

Version 1: With the names of the columns.

```{r left9}
#| echo: TRUE

library(tidyr)

df |> 
  pivot_longer(cols=x:z,
               names_to="key",
               values_to="val")
```

## pivot_longer()

Version 2: With the column counts (second through fourth columns, for example)

```{r left20}
#| echo: TRUE
df |> 
  pivot_longer(cols=2:4,
               names_to="key",
               values_to="val")
```

Setting up a new data frame for another example:

```{r left30}
df <- data.frame(state=c("TX", "NY", "FL"),
                 ducks=c(23, 39, 47),
                 fish=c(6,30,20),
                 birds=c(99,3,64))
```

## pivot_longer() again

Okay, let's consolidate the animals count in the columns.

```{r}
#| echo: TRUE
df
```

:::{.fragment}
```{r}
#| echo: TRUE
df |> 
  pivot_longer(cols=ducks:birds,
               names_to="animals",
               values_to="total")
```
:::

Here's why we turned the wide data to a long one: For easier math.

## pivot for math
```{r math1}
df <- data.frame(state=c("TX", "NY", "FL"),
                 ducks=c(23, 39, 47),
                 fish=c(6,30,20),
                 birds=c(99,3,64))
```

Let's look at the data again:

```{r}
#| echo: TRUE
df
```

But now I'm going to add some code to figure out the percents by state.

:::{.fragment}
```{r}
#| echo: TRUE
df |> 
  pivot_longer(cols=ducks:birds,
               names_to="animals",
               values_to="total") |> 
  group_by(state) |> 
  mutate(percent=
           round(total/sum(total)*100,1))
```
:::

Okay, but now this is hard to look at.

Let's make it go back to a wide data frame.

## pivot_wider()

```{r wider}
df_long <- df |> 
  pivot_longer(cols=ducks:birds,
               names_to="animals",
               values_to="total") |> 
  group_by(state) |> 
  mutate(percent=
           round(total/sum(total)*100,1))
```

```{r}
#| echo: TRUE
df_long
```

## pivot_wider()

:::{.fragment}
```{r}
#| echo: TRUE
df_long |> 
  pivot_wider(names_from="animals", 
              values_from="percent")
```
:::

Okay, this is ugly.

It's because the `total` column is throwing things off.

Let's get rid of it first.

:::{.fragment}
```{r}
#| echo: TRUE
df_long |> 
  select(-total) |> 
  pivot_wider(names_from="animals", 
              values_from="percent") |> 
  mutate(birds_fish_diff=
           birds-fish)
```
:::

Ok, but what if you did want to keep the total values?

## pivot_wider() more columns

```{r widera}
df_long <- df |> 
  pivot_longer(cols=ducks:birds,
               names_to="animals",
               values_to="total") |> 
  group_by(state) |> 
  mutate(percent=
           round(total/sum(total)*100,1))
```

```{r}
#| echo: TRUE

df_long
```

You can keep it with the added `c()` function.

:::{.fragment}
```{r}
#| echo: TRUE
df_long |> 
  pivot_wider(names_from="animals", 
              values_from=c("total", "percent")) 

```
:::

See how it appends the column names in front of the column name?

## Lubridate for dates {background-color="white" background-image="images/lubridate" background-size="100%" }


Let's set up a fake data set:

```{r dates2}
#| echo: TRUE
#| warning: FALSE
#| message: FALSE

library(lubridate)

df <- data.frame(First=c("Charlie", "Lucy", "Peppermint"),
                   Last=c("Brown", "van Pelt", "Patty"),
                   birthday=c("10-31-06", "2/4/2007", "June 1, 2005"))

df
```

All of these different date formats are a nightmare.

Fortunately, there is something consistent about all of them and so lubridate has a function to fix it.

:::{.fragment}
```{r}
#| echo: TRUE
df |> 
  mutate(birthday_clean=mdy(birthday))
```
:::

Other combinations:

## Reading dates

| **Order of elements in date-time**     | **Parse function** |
|----------------------------------------|----------------|
| year, month, day                       | `ymd()`          |
| year, day, month                       | `ydm()`          |
| month, day, year                       | `mdy()`          |
| day, month, year                       | `dmy()`          |
| hour, minute                           | `hm()`           |
| hour, minute, second                   | `hms()`          |
| year, month, day, hour, minute, second | `ymd_hms()`      |

If you wanted to pull out specific parts of a date or time:

## Accessing date parts

| **Date component** | **Function**  |
|----------------|-----------|
| Year           | `year()`    |
| Month          | `month()`   |
| Week           | `week()`    |
| Day of year    | `yday()`    |
| Day of month   | `mday()`    |
| Day of week    | `wday()`    |
| Hour           | `hour()`    |
| Minute         | `minute()`  |
| Second         | `ymd_hms()` |
| Time zone      | `ymd_hms()` |

## Lubridate in action

```{r dates3}
#| echo: TRUE
df
```

:::{.fragment}
```{r}
#| echo: TRUE
df |> 
  mutate(birthday_clean=mdy(birthday)) |> 
  mutate(month=month(birthday_clean)) |> 
  mutate(year=year(birthday_clean)) |> 
  mutate(week=week(birthday_clean))
```
:::


## Recognizing dates

![](images/lubridate_ymd.png)


# Your turn

**practice-day3-wrangling**