-
Notifications
You must be signed in to change notification settings - Fork 18
/
Chapter 01 - Practice - Data wrangling.Rmd
213 lines (146 loc) · 7.4 KB
/
Chapter 01 - Practice - Data wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "DataCamp - Introduction to the Tidyverse"
author: "[Luka Ignjatović](https://github.com/LukaIgnjatovic)"
output:
html_document:
highlight: tango
theme: united
toc: yes
toc_depth: 4
keep_md: true
md_document:
toc: true
toc_depth: 4
---
## Data wrangling
In this chapter, you'll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You'll see how each of these steps lets you answer questions about your data.
<div align="middle">
> **Document:** ["Slides - Data wrangling"](./Slides/Chapter 01 - Data wrangling.pdf)
</div>
<div align="middle">
<video width="80%" controls src="./Videos/Chapter 01 - Lecture 01 - The gapminder dataset.mp4" type="video/mp4"/>
</div>
### Loading the gapminder and dplyr packages
Before you can work with the gapminder dataset, you'll need to load two R packages that contain the tools for working with it, then display the `gapminder` dataset so that you can see what it contains.
To your right, you'll see two windows inside which you can enter code: The `script.R` window, and the R Console. All of your code to solve each exercise must go inside `script.R`.
If you hit *Submit Answer*, your R script is executed and the output is shown in the R Console. DataCamp checks whether your submission is correct and gives you feedback. You can hit *Submit Answer* as often as you want. If you're stuck, you can ask for a hint or a solution.
You can use the R Console interactively by simply typing R code and hitting Enter. When you work in the console directly, your code will not be checked for correctness so it is a great way to experiment and explore.
#### Instructions
* *Use the* `library()` *function to load the* `dplyr` *package, just like we've loaded the* `gapminder` *package for you.*
* *Type* `gapminder`*, on its own line, to look at the gapminder dataset.*
```{r}
# Load the gapminder package
library(gapminder)
# Load the dplyr package
library(dplyr)
# Look at the gapminder dataset
gapminder
```
**Great job! Notice that you can see the gapminder dataset in the console output on the lower right. This is called 'printing' a dataset.**
### Understanding a data frame
Now that you've loaded the `gapminder` dataset, you can start examining and understanding it.
We've already loaded the `gapminder` and `dplyr` packages. Type `gapminder` in your R terminal, to the right, to display the object.
How many observations (rows) are in the dataset?
*Possible answers:*
* **1704**
* *6*
* *1694*
* *1952*
**Correct!**
<div align="middle">
<video width="80%" controls src="./Videos/Chapter 01 - Lecture 02 - The filter verb.mp4" type="video/mp4"/>
</div>
### Filtering for one year
The `filter` verb extracts particular observations based on a condition. In this exercise you'll filter for observations from a particular year.
#### Instructions
*Add a* `filter()` *line after the pipe (*`%>%`*) to extract only the observations from the year 1957. Remember that you use* `==` *to compare two values.*
```{r}
library(gapminder)
library(dplyr)
# Filter the gapminder dataset for the year 1957
gapminder %>%
filter(year == 1957)
```
**That's right! Notice that all the observations in the output have the year 1957.**
### Filtering for one country and one year
You can also use the `filter()` verb to set two conditions, which could retrieve a single observation.
Just like in the last exercise, you can do this in two lines of code, starting with `gapminder %>%` and having the `filter()` on the second line. Keeping one verb on each line helps keep the code readable. Note that each time, you'll put the pipe `%>%` at the end of the first line (like `gapminder %>%`); putting the pipe at the beginning of the second line will throw an error.
#### Instructions
*Filter the* `gapminder` *data to retrieve only the observation from China in the year 2002.*
```{r}
library(gapminder)
library(dplyr)
# Filter for China in 2002
gapminder %>%
filter(country == "China", year == 2002)
```
**Good work! This is a useful way to grab a single observation you're interested in.**
<div align="middle">
<video width="80%" controls src="./Videos/Chapter 01 - Lecture 03 - The arrange verb.mp4" type="video/mp4"/>
</div>
### Arranging observations by life expectancy
You use `arrange()` to sort observations in ascending or descending order of a particular variable. In this case, you'll sort the dataset based on the `lifeExp` variable.
#### Instructions
* *Sort the* `gapminder` *dataset in ascending order of life expectancy (*`lifeExp`*).*
* *Sort the* `gapminder` *dataset in descending order of life expectancy.*
```{r}
library(gapminder)
library(dplyr)
# Sort in ascending order of lifeExp
gapminder %>%
arrange(lifeExp)
# Sort in descending order of lifeExp
gapminder %>%
arrange(desc(lifeExp))
```
**That's right! Take a look at the countries with the highest and lowest life expectancy- is it similar to what you expected?**
### Filtering and arranging
You'll often need to use the pipe operator (`%>%`) to combine multiple dplyr verbs in a row. In this case, you'll combine a `filter()` with an `arrange()` to find the highest population countries in a particular year.
#### Instructions
*Use* `filter()` *to extract observations from just the year 1957, then use* `arrange()` *to sort in descending order of population (*`pop`*).*
```{r}
library(gapminder)
library(dplyr)
# Filter for the year 1957, then arrange in descending order of population
gapminder %>%
filter(year == 1957) %>%
arrange(desc(pop))
```
**Great work! A lot of the exercises in this course will involve combining multiple steps with the** `%>%` **operator.**
<div align="middle">
<video width="80%" controls src="./Videos/Chapter 01 - Lecture 04 - The mutate verb.mp4" type="video/mp4"/>
</div>
### Using mutate to change or create a column
Suppose we want life expectancy to be measured in months instead of years: you'd have to multiply the existing value by 12. You can use the `mutate()` verb to change this column, or to create a new column that's calculated this way.
#### Instructions
* *Use* `mutate()` *to change the existing* `lifeExp` *column, by multiplying it by 12:* `12 * lifeExp`*.*
* *Use* `mutate()` *to add a new column, called* `lifeExpMonths`*, calculated as* `12 * lifeExp`*.*
```{r}
library(gapminder)
library(dplyr)
# Use mutate to change lifeExp to be in months
gapminder %>%
mutate(lifeExp = lifeExp * 12)
# Use mutate to create a new column called lifeExpMonths
gapminder %>%
mutate(lifeExpMonths = lifeExp * 12)
```
**That's right!**
### Combining filter, mutate, and arrange
In this exercise, you'll combine all three of the verbs you've learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.
#### Instructions
*In one sequence of pipes on the* `gapminder` *dataset:*
* `filter()` *for observations from the year 2007,*
* `mutate()` *to create a column* `lifeExpMonths`*, calculated as* `12 * lifeExp`*, and*
* `arrange()` *in descending order of that new column.*
```{r}
library(gapminder)
library(dplyr)
# Filter, mutate, and arrange the gapminder dataset
gapminder %>%
filter(year == 2007) %>%
mutate(lifeExpMonths = lifeExp * 12) %>%
arrange(desc(lifeExpMonths))
```
**Great work! Notice how you can combine several dplyr operations to answer a more complicated question like this.**
**You have finished the chapter "Data wrangling"!**