Instructor: Prof Michael Guerzhoy
The course website is now open
Notes taken by: Skyler Han
- Important resources:
Course Discord: will be used for Q&A
SML201: a similar course taught by the same prof - What is needed for the course
R + Rstudio
An online version of Rstudio - Why learning R
- It runs faster
- Cleaner code -> fewer bugs
- What is data science?
First, let's evaluate some R expressions in the console. The simplest expressions are numerics and strings:
42
"Hello"
R simply repeats these values
A more complex expression is something like 42 + 43
. Let's try this:
42 + 43
85
R evaluated the expression and gave us the value back. We can evaluate more complex expressions too:
(45 - 43) ** 3
8
5 == (4 + 1)
TRUE
2**10 > 10**3
TRUE
#logical values
TRUE,FALSE
cat("Hi engsci!")
Hi engsci!
a <- 5
Here we assign 5 to the variable a, it is worth noting that a=5
also works
Do different things depending on the value of some prior computations
n <- -25
if(n>=0){
cat(n)
}else{
cat(-n)
}
25
This is the same as saying
n <- -25
if(-25>=0){
cat(-25)
}else{
cat(-(-25))
}
Output: 25
Functions are what we can do to simplfy the program.
Don’t Repeat Yourself -- Andy Hunt
square <- function(x){
x**2
}
square(-2)
4
my_abs <- function(x){
if(x>=0){
x
}else{
-x
}
}
cat(my_abs(-4))
4
Todays lecture covered:
- Review functions
- Algebraic approach to understanding functions in programming
- Function stepping
- Use chapgpt to help learn about function
- Vector data frames, etc
f <- function(x){
x**2+1 #(body of the funtion)
}
R uses a different syntax from math
Syntax: the rules according to which sentences are constructed in language
Examples: In English, you should say I ate lunch
instead of lunch I ate
Let's look at this function
f <- function(x){
x**2 +
}
cat(f(5))
Error: unexpected '}' in
" x**2 +
}" To solve this issue, we just look at the error message in the console and debug
Here is a slightly more complicated function
h <- function(x){
y <- 2 * x
y ** 2 - x
}
Here, we defined a local variable y
to help us with the computation. The process we use is
We want to compute h(2)
We substitute x <- 2
We evaluate and substitue y <- 2 * 2
(i.e., 4)
The value the function computes is
The value of h(2)
was evaluated to 14.
Note that we cannot access y
outside of the function h
. That makes sense: y
is defined in terms of the parameter x
, which might change.
Let's look at the function to find roots for quadratic equations
The solution of
sq.roots <- function(A,B,C){
disc <- B**2-4 * A * C
if(disc>0){
r1 <- -B+ sqrt(disc)/(2*A)
r2 <- -B - sqrt(disc)/(2*A)
c(r1,r2)
}else if(disc == 0){
r <- -B/(2*A)
}else{
c()
}
}
Let's see how this function solves the equation
Now we show the function step by step
sq.roots(1,2,1)
# Substitution
disc <- 2**2-4 * 1 * 1
if(disc>0){
r1 <- -2+ sqrt(disc)/(2*1)
r2 <- -2 - sqrt(disc)/(2*1)
c(r1,r2)
}else if(disc == 0){
r <- -2/(2*1)
}else{
c()
}
##############################################
disc <- 0
if(FALSE){
r1 < -1
r2 < -1
c(-1,-1)
}else if(TRUE){
r <- -1
}else{
c()
}
##############################################
r <- -1
Credit: How do you compare and contrast lazy and eager evaluation in different programming languages?
Call by need, this is a technique where an expression is only evaluated when its value is needed. This means that the computation is deferred until the last possible moment, and the result is cached for future use. Delayed evaluation can avoid unnecessary work and save memory.Some programming languages that support lazy evaluation are Haskell, Scala, Clojure, R
and Python
(the language that will be covered in ESC180).
Eager evaluation, also known as call-by-value, is a technique where an expression is evaluated as soon as it is bound to a variable or passed as an argument. This means that the computation is performed upfront, and the result is stored in memory. Some programming languages that use eager evaluation are C
(the language that will be covered in ESC190) , Java, Ruby, and JavaScript.
We now compare this two evaluations by examples
f <- function(x){
cat("f")
x+1
}
g <- function(x){
cat("g")
x*2
}
if (f(3)>4){
g(5)
}else{
g(6)
}
In a delayed language, the function f(3)
is not evaluated until it is compared to 4
, and the function g(5)
or g(6)
is not evaluated until the if
condition is resolved. This means that only one of the functions g
will be executed, and the output will be "f g"
. In an eager language, the function f(3)
is evaluated as soon as it is encountered, and both functions g(5)
and g(6)
are evaluated before the if
condition is checked. This means that both functions g
will be executed, and the output will be "f g g"
.
See this
The basic logic of using ChatGPT is called prompting, which is teaching AI here you want to achieve. Here we want ChatGPT to help us analyze if our step-by-step anaylysis of the function exceptional day
is right
Consider the following two functions:
emo_state <- function(score){
if(score >= 98){
"Hooray"
}else if(score >= 95){
"OK"
}else{
"PANIC"
}
}
cat_emo_state <- function(score){
if(score >= 98){
cat("Hooray")
}else if(score >= 95){
cat("OK")
}else{
cat("PANIC")
}
}
The function emo_state
computes a value, just like our functions f
and g
above. For example, emo_state(99)
evaluates to "Hooray!"
. You can see this in the following example:
a <- emo.state(99)
cat(a)
Hooray!
We first evaluated emo.state(99)
and put it in a
. Then we gave an instruction to R to print the value of a
to the screen.
On the other hand, consider this piece of code:
a <- cat_emo_state(99)
Hooray!
This already had the effect of printing an output to the screen. That's because when R sees cat("Hooray")
, it outputs Hooray
to the screen. But the value that the function computes, which is cat("Hooray")
is not "Hooray"
. In fact, it is
a
You will very rarely need to use cat
inside functions. Outside of Precept 1, you just should not use it at all in this course.
A vector is an order collection of elements of the same type.
Here are examples of vectors in R.
c(5,6,7)
c("dog","cat")
v <- c(TRUE,FALSE,TRUE,TRUE)
We can access the value of elements in a vector by specifying the location of it.
v[1]
v[2]
TRUE
FALSE
v <- c(2,-50,2,4)
sorted_v <- sort(v)
Note that the sort
function in R sort a list in non-decreasing order.
sorted_v
is (-50,2,2,4)
here
v <- c(2,-50,2,4)
length(v)
The length
function gets the length of a vector. In this case, the output should be 4
.
v <- c(2,-50,2,4)
max(v)
The max
function gets the maxinum element a vector. In this case, the output should be 4
.
Answer
Note that the last element of the sorted_v is the biggest element in the vector
my_max <- function(v){ sort(v)[length(v)] } my_max(c(5,2,5,10,2)) #################### sort(c(5,2,5,10,2))[length(c(5,2,5,10,2))] #################### c(2,3,5,5,10)[5] #################### 10
v <- c(5,3)
v <- c(v,1)
v
is now 5 3 1
Suppose we wanted something fancier: extracting all elements between 400 and 550. In other words, suppose we want the elements that are both larger than 400 and smaller than 550. To achieve this, we would want to combine the conditions "larger than 400" and "smaller than 550".
We can do this using logical operators
AND: a & b. TRUE only if a is TRUE and b is TRUE. FALSE otherwise
OR: a | b. TRUE if at least one of a or b is TRUE, FALSE otherwise
NOT: !a. TRUE if a is FALSE, FALSE if a is TRUE
In mathematics, we have such thing called Truth Tables
Some examples(credit: SML201 2020):
pie <- TRUE
icecream <- FALSE
pie | icecream
pie <- FALSE
icecream <- FALSE
pie | icecream
pie <- TRUE
icecream <- FALSE
pie & icecream
pie <- TRUE
icecream <- TRUE
pie | icecream
Note: this is not quite how it works in English. If I say I will have pie or icecream, and then have both, that means what I said wasn't true. But for R, the expression pie | icecream
is TRUE. Technically, |
is called "inclusive OR" (as opposed to the "exclusive OR" we usually mean in English.)
pie <- FALSE
icecream <- TRUE
pie | icecream
pie <- FALSE
icecream <- FALSE
pie | icecream
pie <- TRUE
!pie
So how do you make an expression in R that corresponds to "I will have ice cream or pie"? That is, we want to write an expression that will be TRUE
whenever pie
or icecream
are true, but not both.
Here are several ways of accomplishing this. They all do the same thing
(pie | ice.cream) & !(pie & ice.cream)
(pie == T & ice.cream == F) | (pie == F & ice.cream == T)
(pie & !ice.cream) | (!pie & ice.cream)
pie != ice.cream
xor(pie,ice.cream)
Select elements in v that are between 0 and 3
v <- c(5, 2, -1, 1)
v[v >= 0 & v<= 3 ]
2 1
Data frames are R's way of storing tables (note that the salary data we had was also actually a table).
You can define a data frame using the following syntax:
offers <- data.frame(amount = c(241, 590, 533),
spec = c("family doc", "cardio" "ortho"))
Note the amount
and spec
are the names of columns in the new data frame. We define data frames column-by-column.
The value of offers
will be displayed as follows in the console:
offers
Let's load a data frame (you must previously have successfully run install.packages("babynames")
).
library(babynames) #Load the data frame babynames into R
Here are the first several rows of the data frame
head(babynames) # Display the first 5 rows of a data frame
We can access, for example, row 2 and column "year" of the table like so:
babynames[2, "year"]
If we want to access all of row 2, we can omit the second part:
babynames[2, ]
We can access rows 2 through 6 like so:
babynames[2:6, ]
We can only take the columns "n"
and "year"
, like this:
babynames[2:6, c("n", "year")]
Finally, if we want a particular column as a vector (rather than as a data frame), we can do the following (make sure to keep track of the quotes)
babynames[5:20, ]$name
Note that sometimes, df[, "colname"]
will yield a vector (when you are operating on one kind of data frames), and sometimes, it will a data frame. df$colname
will always yield a vector. So if you want a vector, use the $
operator.
Let's now write code that finds the most common name in 1999
babies_1999 <- babynames[babynames$year == 1999 & babynames$sex == "F", ]
max_name_count <- max(babies.1999$n)
(babies_1999$name)[max_name_count == babies_1999$n]
Let's make this into a more general function:
most_common_name <- function(babynames, year, sex){
baby <- babynames[babynames$year == year & babynames$sex == sex, ]
(baby$name)[baby$n == max(baby$n)]
}
To use pipes, first you need to install and load the tidyverse
library. Use the following:
install.packages("tidyverse")
library(tidyverse)
library(tidyverse)
Let's define two functions:
f <- function(x){
x^2
}
g <- function(y){
y + 1
}
We can compute f(g(5))
and g(f(5))
. Note that those are not the same:
Now, we'll compute the same qunatities using pipes:
5 %>% f %>% g
This is the same as computing g(f(10))
. The way to think about it is this: we start with 10, then apply f
to 10 and obtain f(10)
, and then apply g to 10 %>% f
(i.e., f(10)
) and obtain f(g(10))
.
The shortcut to use %>%
is command
+shift
+M
To compute f(g(10))
, we can use:
5 %>% g %>% f
This is the same as
5 %>% g() %>% f()
If the only thing that we're sending to f
is g(5)
, the parentheses are optional in this notation.
This is useful mostly because it's easier to read something like x %>% f1 %>% f2 %>% f3 %>% f4
than to read f4(f3(f2(f1(x))))
.
The filter
function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year
is 1880 and sex
is "F"
.
filter(babynames, year == 1880, sex == "F")
idx <- (babynames$year == 1880) & (babynames$sex == "F")
babynames[idx, ]
babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]
babynames %>% filter(year == 1880 & sex == "F")
babynames %>% filter(year == 1880, sex == "F")
The last way is probably preferable, though different people might have different preferences.
We can use select
as follows:
b <- babynames %>% select(year, name)
b[1:20, ]
We only kept the year
and name
columns here.
First, let's use summarise
to compute the average number of babies per name:
babynames %>% summarise(pername = mean(n))
Inside of summarise
, we can refer to columns in the data frame we are processing by name, without the $
operator. (And you shouldn't be using the $
operator.) This is basically the same as not using summarise
and computing mean(babynames$n)
-- not that interesting.
The real power of summarize
is in using group_by
-- we can group rows and compute a function such as mean
for every group of rows separately. For example, we could compute the average number of babies per name for different sexes separately.
babynames %>% group_by(sex) %>%
summarise(b.pername = mean(n))
We see that the number of babies per name for "F"
is much smaller than for "M"
-- female names are more diverse with fewer babies per name, and male names are less diverse with more babies per name.
- go through all the data wrangling tools that tidyverse provides
- go through tracing of data wrangling
We've introduced summarise
and group_by
in the last lecture
The filter
function is used to select rows from a data frame. The following are all equivalent. They are different ways of selecting rows for which the year
is 1880 and sex
is "F"
.
filter(babynames, year == 1880, sex == "F")
idx <- (babynames$year == 1880) & (babynames$sex == "F")
babynames[idx, ]
babynames[(babynames$year == 1880) & (babynames$sex == "F"), ]
babynames %>% filter(year == 1880 & sex == "F")
babynames %>% filter(year == 1880, sex == "F")
The last way is probably preferable, though different people might have different preferences.
`arrange`` is used to produce a data frame sorted in whatever way you want to specify. See what happens when you run
View(babynames %>% arrange(name))
The resultant data frame is sorted alphabetically by name
. If you want to sort things in descending order, use
`View(babynames %>% arrange(desc(name)))`
Note that, similarly to filter
and other functions, this does not change babynames
: it just produces a new data frame with the same contents, sorted in the way we specify. You can also sort by year
, and then also sort by name
, within the same year. Try running and viewing the following:
babynames %>% arrange(year, name)
group_by
is setting groups in the data frame while arrange
is reordering it
select
is used to produce a data frame only with the specified columns (the columns are in the order that you specify):
wanted <- babynames %>% select(year, sex, number)
now we store a chart with 3 columns year, sex and number to wanted
Let's try to calculate GDP per capital(GDP divided by the population) in `gapminder``.
We can use mutate
to compute a new column:
g_gdp <- gapminder %>% mutate(gdp = gdpPercap *pop/1000)
The share of world GDP for each country for each year
g_gdp_share <- gapminder %>% group_by(year) %>% mutate(gdp_share = gdpPercap* pop/sum(gdpPercap*pop) )
dplyr
's distinct
is similar to the function unique
, which we already saw, but it operates on data frames. It takes in a data frame, and returns all the distinct rows. That is, duplicate rows are not included in the returned data frame.
Let's try this:
b.total <- babynames %>% mutate(total_by_year = round(n/prop))
b.total %>% select(sex, year, total_by_year) %>% distinct
What can you conclude from this?
We can also give arguments to distinct
. (For precise language fans: an argument is a value we pass to the function; a parameter is the variable we assign the argument to).
babynames %>% distinct(year)
babynames %>% distinct(name, year)
Note that There are as many distinct combination of (name, year) as there are rows in babynames
.
Suppose we want the largest life expectancy that each country achieved based on the data frame below.
Country lifeExp Canada 73.0 Canada 76.4 United States 70.0 United States 75.0
gapminder %>% group_by(country) %>% summarize(max_lifeExp = max(lifeExp))
#trace:
#--------------------------------------
df <- data.frame(country = c("Canada", "USA" ),maxLifeExp = c(max(c(73,76.4)), max(c(70,75))))
#-------------------------------
df <- data.frame(country = c("Canada", "USA" ),maxLifeExp = c(76.4, 75))
The output is
Country lifeExp Canada 76.4 United States 75.0