/
README.Rmd
210 lines (140 loc) · 9.15 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
output: github_document
always_allow_html: true
author: Joachim Gassen
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE, warning = FALSE, message = FALSE,
cache = TRUE, fig.path = "man/figures/", fig.align="center"
)
library(tidyverse)
library(knitr)
```
# Download, Tidy and Visualize Covid-19 Related Data
## Disclaimer
I am an applied economist studying the economic effects of regulatory interventions on corporate transparency and leading the Open Science Data Center (OSDC) of the [TRR 266 Accounting for Transparency](https://accounting-for-transparency.de), which is funded by the German Science Foundation (DFG).
The OSDC has the objective to make research transparent in a way that others can contribute and collaborate.
This is the spirit that motivated me to set up this package. I am clearly no epidemiologist so I will abstain from from providing infrastructure for analyzing the spread of the disease or estimating the effects of non-pharmaceutical interventions. Instead this package serves the purpose to facilitate the use of various Covid-19 related data sources with a special focus on non-pharmaceutical interventions.
In that way, I hope that it might be helpful for others that are interested in doing research on the Covid 19 pandemic by promoting the benefits of open science.
## The Data
As of `r format(Sys.Date(), "%B %d, %Y")` these are the included data sources
```{r DataSources}
data(tidycovid19_data_sources)
df <- tidycovid19_data_sources %>% select(-id)
df$description[nrow(df)] <- paste(
"The merged dataset provided by the tidycovid19 R package. Contains data",
"from all sources mentioned above."
)
kable(df) %>% kableExtra::kable_styling()
```
## How to Use the Package
The idea is simple. Load the data using the functions above and code away. So, for example:
``` {r Example}
# Suggestion by AndreaPi (issue #19)
library(tidyverse)
library(tidycovid19)
library(zoo)
df <- download_merged_data(cached = TRUE, silent = TRUE)
df %>%
filter(iso3c == "USA") %>%
mutate(
new_cases = confirmed - lag(confirmed),
ave_new_cases = rollmean(new_cases, 7, na.pad=TRUE, align="right")
) %>%
filter(!is.na(new_cases), !is.na(ave_new_cases)) %>%
ggplot(aes(x = date)) +
geom_bar(aes(y = new_cases), stat = "identity", fill = "lightblue") +
geom_line(aes(y = ave_new_cases), color ="red") +
theme_minimal()
```
The data comes with two meta data sets that describe the data. The data frame `tidycovid19_data_sources` listed above provides short descriptions and links for each data source used by the package. The data frame `tidycovid19_variable_defintions` provides variable definitions for each variable included in the merged country-day data frame provided by `download_merged_data()`:
```{r VarDefs}
data(tidycovid19_variable_definitions)
df <- tidycovid19_variable_definitions %>%
select(var_name, var_def)
kable(df) %>% kableExtra::kable_styling()
```
There are more examples on how to code in the code file in the main directory with the revealing name `code_examples.R`. Explore and reuse!
## Visualization
The focus of the package lies on data collection and not on visualization as there are already many great tools floating around. Regardless, there are three functions that allow you to visualize some of the key data that the package provides.
### Plot Covid-19 Spread over Event Time
The function `plot_covid19_spread()` allows you to quickly visualize the spread of the virus in relation to governmental intervention measures. It is inspired by the insightful displays created by John Burn-Murdoch from the Financial Times and offers various customization options.
```{r DemoPlot}
#remotes::install_github("joachim-gassen/tidycovid19")
library(tidycovid19)
merged <- download_merged_data(cached = TRUE, silent = TRUE)
plot_covid19_spread(
merged, highlight = c("ITA", "ESP", "GBR", "FRA", "DEU", "USA", "BRA", "MEX"),
intervention = "lockdown", edate_cutoff = 330
)
```
### Plot Covid-19 Stripes
Another option to visualize the spread of Covid-19, in particular if you want to compare many countries, is to produce a stripes-based visualization. Meet the Covid-19 stripes:
```{r Covid19Stripes, fig.height=12}
plot_covid19_stripes()
```
Again, the function comes with many options. As an example, you can easily switch to a per capita display:
```{r Covid19StripesPerCapita, fig.height=12}
plot_covid19_stripes(
per_capita = TRUE,
population_cutoff = TRUE,
sort_countries = "magnitude"
)
```
Or single out countries that you are interested in
```{r Covid19StripesSelCountries}
plot_covid19_stripes(
type = "confirmed",
countries = c("ITA", "ESP", "FRA", "GBR", "DEU", "USA", "BRA", "MEX"),
sort_countries = "countries"
)
```
### Map Covid-19
Finally, I also included a basic mapping function. `map_covid19()` allows you to map the spread of the virus at a certain date both world-wide ...
```{r MapWorldWide}
map_covid19(merged, cumulative = TRUE, per_capita = TRUE)
```
... or for certain regions.
```{r MapEurope}
map_covid19(merged, type = "deaths", cumulative = TRUE, per_capita = TRUE, region = "Europe")
```
If you have enough time (takes several minutes), you can also create an animation to visualize the spread of the virus over time.
```{r AnimatedMapWorldWide, eval = FALSE}
df <- merged %>% filter(!is.na(confirmed))
map_covid19(
df, type = "confirmed", per_capita = TRUE, dates = unique(df$date)
)
```
Again, you can customize the data that you want to plot and of course you can also modify the plot itself by using normal `ggplot` syntax.
## Shiny App
Sorry, I could not resist. The options of the `plot_covid19_spread()` make the
implementation of a shiny app a little bit to tempting to pass. The command
`shiny_covid19_spread()` starts the app. Click on the image to be taken to the
online app. You can use it to customize your `plot_covid19_spread()`
display as it allows copying the plot generating code to the clipboard,
thanks to the fine [{rclipboard}](https://github.com/sbihorel/rclipboard)
package. You can now also customize the app by providing `plot_covid19_spread()`
options as a list to the `plot_options` parameter.
<center>
[![Screenshot of `shiny_covid19_spread()` app](man/figures/shiny_covid19_spread.png)](https://jgassen.shinyapps.io/tidycovid19/)
</center>
As the shinyapps.io server has had some issues with exhausting connections, you can also
use this [alternative server](https://trr266.wiwi.hu-berlin.de/shiny/tidycovid19/).
## Blog posts
The blog posts are mostly dated. I am leaving the links here for reference:
- [An intro blog post](https://joachim-gassen.github.io/2020/05/tidycovid19-new-data-and-doc/) providing a quick walk-through of the package.
- [A blog post](https://joachim-gassen.github.io/2021/01/vaccination-data-an-outlier-tale/) on the OWID vaccination data.
- [A blog post](https://joachim-gassen.github.io/2020/04/tidycovid19-new-viz-and-npi_lifting/) on the new visuals of the package.
- [A blog post](https://joachim-gassen.github.io/2020/04/covid19-explore-your-visualier-dof/) on the visualizer degrees of freedom that are inherent in a plot of the Covid-19 spread.
- [A blog post](https://joachim-gassen.github.io/2020/04/scrape-google-covid19-cmr-data/) on the PDF scraping of the new Google Covid-19 Community Movement Reports.
- [A somewhat dated blog post](https://joachim-gassen.github.io/2020/04/exploring-and-benchmarking-oxford-government-response-data/) comparing the ACAPS and Oxford data on governmental interventions.
- [An older blog post](https://joachim-gassen.github.io/2020/03/merge-covid-19-data-with-governmental-interventions-data/) that showcases some descriptive visuals to see what one can do with the data retrieved by this package.
## Why yet another package on Covid-19?
There are several packages that provide data and infrastructure related to Covid-19. Two prominent cases are:
- `{nCov2019}`: This [package](https://github.com/YuLab-SMU/nCov2019) has a focus on Chinese data but also contains data on other countries and regions. It contains a shiny dashboard.
- `{conronavirus}`: This [package](https://github.com/RamiKrispin/coronavirus) provides
the Johns Hopkins University CSSE dataset together with a dashboard
Additional R related resources on Covid-19 can be found [here](https://www.statsandr.com/blog/top-r-resources-on-covid-19-coronavirus/) and [here](https://github.com/mine-cetinkaya-rundel/covid19-r).
Other than the packages mentioned above, the key objective of the {tidycovid19} package is to provide *transparent* access to *various* data sources at the country-day level, including data on governmental interventions and on behavioral response of the public. It does not contain any data per se. Instead, it provides functions to pull data from publicly available authoritative sources. The sources and the data are documented by additional data frames included in the package. While the combined data frame generated by `download_merged_data()` aggregates data at the country-day level, some functions also provide sub-country level data on request.
For those interested in speedy downloads it alternatively provides the option to download from the cached data in this repo (stored in the directory `cached_data`). The cached data is updated daily.