Workshop1.Rmd

---
title: "Scientific Data Analysis 6003ESC"
subtitle: "Tutorial for Workshop 1"
author: "Dr. Ido Bar"
date: "04/03/2021"
output: 
    html_document:
#      css: "style/style.css"
      toc: true
      toc_float: true
      toc_depth: 3
      highlight: pygments
      number_sections: false
      code_folding: hide
---

```{r setup, include=FALSE}
pacman::p_load(htmltools, knitr, tidyverse, captioner, fontawesome)
pacman::p_load_gh("gadenbuie/tweetrmd")
# pacman::p_load_gh("mitchelloharawild/icons", update = FALSE)
knitr::opts_chunk$set(echo = TRUE, fig.align='center')
# setup figure and table captions
figs <- captioner(prefix="Figure")
tbls <- captioner(prefix="Table")
figs(name="ecocloud_dash", "EcoCloud dashboard screenshot.")
figs(name="binder", "Binder launch screenshot.")
figs(name="binder_export", "Download files from Binder/RStudio screenshot.")
figs(name="jupyter_dash", "Jupyter dashboard screenshot.")
figs(name="rstudio_project", "Create a new project in RStudio screenshots.")
figs(name="tidyverse_workflow", "An example of a data analysis workflow using packages from the Tidyverse (credit to [The Centre for Statistics in Ecology, the Environment and Conservation, University of Cape Town](http://www.seec.uct.ac.za/r-tidyverse)).")
figs(name="asia_lifexpct1","Life expectancy by years in Asian countries (first try).")
figs(name="asia_lifexpct2","Life expectancy by years in Asian countries (added line graph).")
figs(name="asia_lifexpct3","Life expectancy by years in Asian countries (added line graph coloured by country).")
figs(name="asia_lifexpct4","Life expectancy by years in Asian countries (beautify the plot with themes, colour palettes and labels).") 
figs(name="gdp_life","Relationship between GDP per capita and life expectancy by continent")
figs(name="gdp_life_log","Relationship between GDP per capita and life expectancy by continent (log-scaled X-axis")
figs(name="ggplot2_layers","A visualisation of the layer concept in 'ggplot2' package (starting from bottom up, credit to [Coding Club](https://ourcodingclub.github.io/tutorials/dataviz-beautification-synthesis/#distributions)).")
figs(name="GOT_palette", "An example of a 'ggplot2' theme inspired by Game of Thrones ([tvthemes package](https://github.com/Ryo-N7/tvthemes))")
```

```{js logo-js, echo=FALSE}
$(document).ready(function() {
  $('#header').parent().prepend('<div id=\"Griffith logo\"><img src=\"https://www.griffith.edu.au/__data/assets/image/0018/653121/Griffith_Full_Logo_scaled.png\" style=\"position:absolute; top:0; right:0; padding:20px; height:120px\"></div>');
  $('#header').css('margin-right', '120px')
});
```

## Instructions

The following tutorial will let you reproduce the plots that we created at the lecture using R.  
Please read carefully and follow the steps. Wherever you see the <kbd>Code</kbd> icon on the right you can click on it to see the actual code used in that section (see a simple example below this paragraph). You are more than welcome to try it yourself before checking the code, but there's also the option to show or hide all code blocks at the very top right of this document.  
Good luck!

```{r code_example, eval=TRUE}
print("Hello World!")
```

## `r  fa("r-project", fill = "#384CB7")`

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

### RStudio

RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace. It requires R to be installed prior to be able to send commands to the interpreter.

### Using R and RStudio from Cloud services
If we want to keep things simple (for this course) or we would like to use R on shared computers, where we can't install software, we can run R and Rstudio through a web client that is hosted on a remote server.  
We will use the [Binder](https://mybinder.org/) service, which is free, easy to use and can be launched from a single GitHub repository (more about this in the workshop).  
See the appendix for more details how you can run R and RStudio on [EcoCloud](https://ecocloud.org.au/), another cloud-based service free for Griffith students.

#### Running R and RStudio on Binder
Using Binder is as simple as clicking on the Binder badge - [![Launch Rstudio Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/IdoBar/6003ESC_Workshop1/main?urlpath=rstudio){target="_blank"}.  
Alternatively, you can navigate to the [Binder](https://mybinder.org/){target="_blank"} homepage and enter the URL of this tutorial [GitHub repository](https://github.com/IdoBar/6003ESC_Workshop1.git){target="_blank"} `https://github.com/IdoBar/6003ESC_Workshop1.git` and click on the **launch** buton (see screenshot in `r figs(name="binder",display="cite")` below).
 
```{r binder, echo=FALSE, fig.cap=figs(name="binder"), out.width = '100%'}
knitr::include_graphics("figs/Screenshot_The_Binder_Project.png")
```

Now be patient while the environment is loading...  
You should now see in your web browser an RStudio interface (if you got to the Jupyter page, click on new --> RStudio) and are ready to start working in R in "The Cloud"! `r  fa("cloud", fill ="#5599FF")``r  fa("cloud", fill ="#5599FF")``r  fa("cloud", fill ="#5599FF")`

##### Downloading files from Binder
After we've finished working on Binder we would like to download the R script that we wrote and any output files (summary tables and figures). We can access those files by using the `files` tab in RStudio (bottom right pane).  
Select the files/folders that you would like to download and click on `r  fa("cog", fill ="#4383A4")`More `r fa("long-arrow-alt-right")` Export... (see screenshot in `r figs(name="binder_export",display="cite")` below) to save the file on your computer.  

```{r binder_export, echo=FALSE, fig.cap=figs(name="binder_export"), out.width = '75%'}
knitr::include_graphics("figs/Rstudio_export_screenshot.png")
```


### Installing R and RStudio locally
Alternatively, both R and RStudio can be installed locally on any operating system (`r  fa("apple")`, `r  fa("windows",  fill ="#5599FF")`, or `r  fa("linux", )`, see a [detailed tutorial](https://www.datacamp.com/community/tutorials/installing-R-windows-mac-ubuntu%20){target="_blank"}), which provides complete control over the installation, added packages and can be used anywhere without requiring internet connection. This is recommended for anyone who is planning to do any future serious analysis in R (including the assignments in this course).

#### Project Management with RStudio

Regardless whether we installed R and RStudio locally or we use the Binder service, we interact with R through the RStudio integrated development environment (IDE), which let's us easily write our code, test it, see our files, objects in memory and plots that we produce. If we run the analysis locally, it is highly recommended to use RStudio's built-in Projects to contain our analysis in its own folder with all the files required. That will also help in reading data files and writing results and figures back to the hard drive.   

>1. Start RStudio by clicking on its icon.  
2. Start a new project by selecting "File --> New Project" or clicking on the "New Project" icon (under "Edit" in the taskbar).  
3. Select "New Directory --> New Project" and then enter "Workshop1" in the Directory name text box and browse to the "wrokspace" folder to create the project folder in (see screenshots A-D in `r figs(name="rstudio_project",display="cite")` below)
 
```{r rstudio_proj, echo=FALSE, fig.cap=figs(name="rstudio_project"), out.width = '100%'}
knitr::include_graphics("figs/RStudio_create_project.png")
```

>4. Create a new R script file by selecting "File --> New File --> R Script" or clicking on the "New File" icon (under the "File" in the taskbar)   
5. Save the script file by select "File --> Save" or pressing <kbd>Ctrl</kbd>+<kbd>s</kbd> or clicking on the floppy disk icon on the top bar  


### Install Packages
R can be extended with additional functionality by installing external packages (usually hosted at the Comprehensive R Archive Network repository -- [CRAN](https://cran.r-project.org/web/packages/index.html){target="_blank"}). To find which packages can be useful for your type analysis, use search engine (Google is your friend) and the available [Task Views on CRAN](https://cran.r-project.org/web/views/){target="_blank"}, which provide some guidance which packages on CRAN are relevant for tasks related to a certain topic.  
For our current analysis we will use some packages from the [tidyverse](https://www.tidyverse.org/){target="_blank"} -- a suite of packages designed to assist in data analysis, from reading data from multiple source (`readr`, `readxl` packages), through data wrangling and cleanup (such as `dplyr`, `tidyr`) and finally visualisation (`ggplot2`), as can be seen in `r figs(name="tidyverse_workflow", display="cite")`.  
_if tidyverse is failing to install then try [bplyr](https://github.com/yonicd/bplyr){target="_blank"}_

```{r tidyverse_wf, echo=FALSE, fig.cap=figs(name="tidyverse_workflow"), out.width = '80%'}
knitr::include_graphics("figs/tidy_workflow.png")
```

These packages are already pre-installed in Binder, but they will need to be installed if you chose to run the analysis locally.  
To install these packages, we use the `install.packages('package')` command, please note that the package name need to be quoted and that we only need to be perform it once, or when we want or need to update the package.  Once the package was installed, we can load its functions using the `library(package)` command. _Note that in this case we use the package name without quotes!_.

```{r install_packages, eval=FALSE}
# install required packages - needed only once! (comment with a # after first use)
install.packages("tidyverse")
install.packages("here")
# It seems like there is a probelm with some versions of the `tibble` package, which we can overcome by installing the most recent development version
install.packages("remotes")
remotes::install_github("tidyverse/tibble")
```

## Analyse Data in R
### Read Data
Now that we've got RStudio up and running and our packages installed, we can load them and read data into R from our local computer or from web locations using dedicated functions specific to the file type (`.csv`, `.txt`, `.xlsx`, etc.).  
We need to provide a variable name that will store the data and remember that R holds its variables in the computers RAM (which will be the limiting factor in terms of size of data that can be handled).   
We will use the `read_csv()` command/function from the `readr` package (part of the `tidyverse`) to load the data from a file hosted on the web into a variable of type **data frame** (table).

```{r read_data, message=FALSE, warning=FALSE}
# load required packages
library(tidyverse)
# Read data straight from the web
gapminder <- read_csv("https://tinyurl.com/gapdata")
```

You can also read the data from a local folder (in this case the file `gapminder.csv` in the `data` folder)

```{r read_local_data, message=FALSE, warning=FALSE, eval=FALSE}
# or from a local folder
gapminder <- read_csv("data/gapdata.csv")
```

### Data Exploration

we can explore the data by clicking on the table icon next to the variable name in RStudio "Environment" tab (top right pane), but that's not a good practice because it won't work with large data sets.
We can use built-in functions for a brief exploration (such as `head()` to show the first 10 rows of the data and `str()` for the type of data in each column):

```{r explore_data}
#explore the data frame
head(gapminder) # show first 10 rows of the data and typr of variab;es
summary(gapminder) # brief summary statistics
```

## Plotting Data

We will use the `ggplot2` package, which stands for "Grammar of Graphics" and breaks up graphs into semantic components such as scales and layers that can be added to the plotting area, as can be seen in `r figs(name="ggplot2_layers", display="cite")`.  
There are lots of online tutorials on the use of `ggplot2`, one of my favourites is the [Beautiful plotting in R: A ggplot2 cheatsheet](http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/){target="_blank"}.  

```{r ggplot_layers, echo=FALSE, fig.cap=figs(name="ggplot2_layers"), out.width = '80%'}
knitr::include_graphics("https://ourcodingclub.github.io/assets/img/tutorials/dataviz-beautification/DL_datavis1_layers.png")
```

### Line plot

In this exercise we will plot the life expectancy throughout the years for the first 12 Asian countries (alphabetically) as a line graph.  We use the `%>%` notation (can be easily entered with <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>m</kbd> keyboard shortcut) to "pipe" our data from one processing step to another without having to save it as intermediate variables.  In this example we take `gapminder` data frame, filter it by `coontinent=="Asia"` (_notice the double `==` sign and that the match is case-sensitive!_), then select the `country` column, identify the unique country names with `distinct()`, select the first 12 countries with the `slice(start:end)` indexing notation and finally extract the country column as a vector. We then create a subset of the data using these countries as our reference to match against when we filter by country name (with the `%in%` notation).  

Once we have a subset of the data we can use it as input for plotting with `ggplot()` function. We need to specify to the function which data to operate on and how to map plotting features (such as X and Y axes). We can (and will) map other plotting features to variables (columns) in our data later on.  
Let's see what we're getting when running the following code:

```{r expct_per_year1, include=TRUE, echo=TRUE, fig.cap=figs(name="asia_lifexpct1"), out.width = '80%'}
# create a vector of the first 12 Asian countries
first12_Asian_countries <- gapminder %>% filter(continent=="Asia") %>% 
  select(country) %>% distinct() %>% 
  slice(1:12) %>% .$country 
# use the vector to subset the data to include only these countries
gap_asia <- gapminder %>% filter(continent=="Asia", country %in% first12_Asian_countries) 
# create the plot 
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp)) 
```

We got an empty canvas in `r figs(name="asia_lifexpct1", display="cite")`, but it had been sized to fit our range of data on the X and Y axes, that will be our first layer of the plot (like the bottom one in `r figs(name="ggplot2_layers", display="cite")`).  

Now, let's use the `+` sign to add additional layers to the plotting canvas, starting with the type of graph we want to plot (in this case a line graph), which can be achieved by adding a `geom_line()` function, which stands for "line geometry" (in a similar way we can add other plotting geometries, such as `geom_point()`, `geom_bar()`, etc.).  

```{r expct_per_year2, include=TRUE, echo=TRUE, fig.cap=figs(name="asia_lifexpct2"), out.width = '80%'}
# create the plot
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp)) + 
  geom_line(size=1) 
  
```

What just happened (`r figs(name="asia_lifexpct2", display="cite")`)? Can you guess?  

`ggplot2` is just doing what we asked it to do, it plots all the year-lifeExp combinations and connects them with a single line, regardless of which country the data came from.
What we actually want is to connect the dots of each country separately for the graph to make sense! 

Let's assign specific colour (of lines/bar/markers) to each country in our data, this is done through the `aes()` function, which can go either in the initial `aes()` function within the `ggplot()` function, or added later inside the particular geometry we're adding (inside `geom_line()` in this case). `ggplot2` is smart enough to know that if we map a colour to each country, then the data from each country should be grouped and plotted as a separate line.  

```{r expct_per_year3, include=TRUE, echo=TRUE, fig.cap=figs(name="asia_lifexpct3"), out.width = '80%'}
# create the plot
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp, color=country)) + 
  geom_line(size=1) 
```

Hooray! this looks much better (`r figs(name="asia_lifexpct3", display="cite")`).  
We just need a few more final touches to make it "publication-ready" (in the same order their layers are added below):

* Using a custom theme (black and white) with increased font size
* Using a custom colour palette (requires the `RColorBrewer` package which is an integral part of the `tidverse`)
* Changing the resolution of the years in the X-axis
* Adding a title to the plot (though the caption below should suffice in scientific publications) and correcting the X and Y labels and legend title (note that we actually change the label of `color` aesthetic for that). 

The final result can be seen in `r figs(name="asia_lifexpct4", display="cite")`.  

```{r expct_per_year4, include=TRUE, echo=TRUE, fig.cap=figs(name="asia_lifexpct4"), out.width = '80%'}
ggplot(data = gap_asia, mapping = aes(x = year, y = lifeExp, color=country)) + 
  geom_line(size=1) + theme_bw(14) + # geom_smooth(method="lm", size=1.5) + 
  scale_color_brewer(palette = "Paired") + 
  scale_x_continuous(breaks = seq(min(gap_asia$year), max(gap_asia$year), by = 10)) +
  labs(title = "Life expectancy by years in Asian countries", x = "Year", y = "Life Expectancy (years)", color="Country")
```

We would like to save it to a sub-folder following our "best practice" rule of separating raw data from output files from analysis scripts. We can create a sub-folder named `output` using the file explorer in Windows or the one built-in in RStudio (first tab in the bottom right pane). We can of course do it with an R command `dir.create()` as demonstrated below.  
Once the output folder is created, we use the `ggsave()` function to save our current plot.

```{r save_plot}
# create output folder
dir.create("./output", showWarnings = FALSE)
# save plot to file
ggsave(here::here("output/Asia_12_countries_lifeExp_through_years.pdf"), width=10, height=8)
```

### X-Y plot

This time we'll look at the relationship between GDP per capita and life expectancy using `geom_point()` for X-Y scatter graph and we'll colour the points by continent. We'll already apply the theme, custom colour palette and labels.  

```{r gdp_per_expct1, include=TRUE, echo=TRUE, fig.cap=figs(name="gdp_life"), out.width = '80%'}
# create plot
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent)) +
  geom_point(size=3.5) + theme_bw(14) + # geom_smooth(method="lm", size=1.5) + 
  scale_color_brewer(palette = "Set1") +
  labs(title = "GDP per capita by life expectancy", x = "GDP per capita", y = "life expectancy", color="Continent")
```

As we can see in `r figs(name="gdp_life", display="cite")`, the markers overlay each other and it's hard understanding where there's a high density of data points and if they cover others behind them. To fix it, we'll make the points semi-transparent using `alpha=0.5` argument in `geom_point()`.  
Another issue is that the graph seems exponential, which makes it hard to see a trend and interpret the results. We can log-transform the data (which variable, `gdpPercap` or `year`?) to see if it will linearise, or we can use a visualisation "trick" and log-transform just the axis.  
Check out `r figs(name="gdp_life_log", display="cite")`, the plot looks much better! 
We'll save this plot to the `output` folder as well.

```{r gdp_per_expct2, include=TRUE, echo=TRUE, fig.cap=figs(name="gdp_life_log"), out.width = '80%'}
# create plot
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color=continent)) +
  geom_point(alpha=0.5, size=3.5) + theme_bw(18) + # geom_smooth(method="lm", size=1.5) + 
  scale_x_log10() + scale_color_brewer(palette = "Set1") +
  labs(title = "GDP per capita by life expectancy by continent", x = "GDP per capita", y = "life expectancy", color="Continent")
# save plot to file
ggsave(here::here("output/gdpPercap_vs_lifeExp_by_continent.pdf"), width=10, height=8)
```

 
Now for the hard part, what can we learn from the data? What other plots can we generate to help us understand trends from the data and gaps between countries?  
We can discuss these in details in class...  

Please contact me at i.bar@griffith.edu.au for any questions or comments.

## Additional Resources

* R WTF: What They Forgot to tell you about R, tips and best practices for working in R ([link](https://rstats.wtf/){target="_blank"})  
* R for Reproducible Scientific Analysis: A Software Carpentry course -- https://swcarpentry.github.io/r-novice-gapminder/ (offered regularly here at GU, follow [\@hackyhourGU](https://twitter.com/hackyhourGU){target="_blank"} on Twitter)  
* From Data to Viz leads you to the most appropriate graph for your data -- https://www.data-to-viz.com/  
* [Coding Club](https://ourcodingclub.github.io/){target="_blank"}: A Positive Peer-Learning Community (ecology and environmental science students and researchers from the University of Edinburgh) -- https://ourcodingclub.github.io/  
* Beautiful plotting in R: A ggplot2 cheatsheet -- provides lots of recipes and easy fixes for common tasks when creating `ggplot2` plots [link](http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/){target="_blank"}   
* Plot anything with ggplot2 - [YouTube workshops](https://www.youtube.com/watch?v=h29g21z0a68){target="_blank"}  
<iframe width="560" height="315" src="https://www.youtube.com/embed/h29g21z0a68" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>  
* Useful cheatsheets can be found on the [RStudio Cheatsheets website](https://rstudio.com/resources/cheatsheets/){target="_blank"} (start with the most useful/basic ones relevant to this course - RStudio IDE, Data Import, Data Transformation and Data Visualization)  
* [paletteer](https://github.com/EmilHvitfeldt/paletteer){target="_blank"} -- An R package that let's you use colour palette from a huge range of available packages (see examples of available palettes [here](https://github.com/EmilHvitfeldt/r-color-palettes){target="_blank"} and you can even use [TV-themed palettes](https://github.com/Ryo-N7/tvthemes){target="_blank"}) 

```{r bluey-tweet}
include_tweet("https://twitter.com/DrIdoBar/status/1272919164252897286")
```


### Appendix - Running R and RStudio on EcoCloud
[EcoCloud](https://ecocloud.org.au/){target="_blank"} platform that provides this service for all students and staff of participating Australian and NZ universities and government agencies (including Griffith University). A guide on using RStudio within EcoCloud is available [here](https://support.ecocloud.org.au/support/solutions/articles/6000200390-using-rstudio){target="_blank"}.

>1. Please navigate to [EcoCloud](https://ecocloud.org.au/){target="_blank"} and follow the prompts to login using your Griffith credentials (AAF login)  
2. In your dashboard, click on the orange "Launch notebook server" button in the middle of the screen and select "Rstudio notebook" in the popup window and click on the green "Launch" button (see screenshot in `r figs(name="ecocloud_dash",display="cite")` below)  

```{r ecocloud_dash, echo=FALSE, fig.cap=figs(name="ecocloud_dash"), out.width = '80%'}
knitr::include_graphics("figs/ecocloud_Dashboard.png")
```

>3. Once the server is running, click on the green "Open" button, a new browser tab will open with the JupyterLab dashboard  
3b. _optional_ Whenever a new server is started on EcoCloud, all previous RStudio settings and installed packages get reset back to defaults.  To overcome this behaviour and make it more user-friendly for long-term and recurring uses, start a new terminal from the JupyterLab dashboard (see bottom of `r figs(name="jupyter_dash",display="cite")`) and type the following Bash (Linux) commands:

```{bash setup_rstudio, eval=FALSE}
mkdir -p ~/workspace/.rstudio/library
echo ".libPaths('~/workspace/.rstudio/library')" >> .Rprofile
ln -s ~/workspace/.rstudio ~/
```
_The last 2 commands need to be run every time the server restarts_

>4. Click on the RStudio logo in the JupyterLab dashboard (see screenshot in `r figs(name="jupyter_dash",display="cite")` below)

```{r jupyter_dash, echo=FALSE, fig.cap=figs(name="jupyter_dash"), out.width = '80%'}
knitr::include_graphics("figs/Screenshot_JupyterLab.png")
```

>5. You are now ready to start working in R in the cloud!!