Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample Quiz 1 Solutions #32

Open
chendaniely opened this issue Jun 1, 2022 · 5 comments
Open

Sample Quiz 1 Solutions #32

chendaniely opened this issue Jun 1, 2022 · 5 comments

Comments

@chendaniely
Copy link
Contributor

Solutions for the blank + short answer questions in the practice quiz

@chendaniely
Copy link
Contributor Author

Given the plot below, how would you change it to make it more effective? Also tell us why your changes would make it more effective.

image

Solution:

I would get rid of the pyramid visualization in use a bar graph instead. the X axis would be the labels (women, paid enough, paid too much) and the y axis would be the percentage. This would make the percentages easier to compare with one another and better show the actual differences between the percentages.

@chendaniely
Copy link
Contributor Author

chendaniely commented Jun 1, 2022

Given the code below, how would you edit it to map the column vaccine to the colour and shape of the points (re-write the code with the changes you would make).

compare_vacc_plot <- ggplot(world_vaccination, aes(x = year, y = pct_vaccinated)) + 
    geom_point(aes()) + 
    xlab('Year') + 
    ylab('Percentage Vaccinated')
compare_vacc_plot

I'm using diamonds so you can actually run the data.

ggplot(diamonds, aes(x = carat, y = price) + 
  geom_point(aes(color = cut, shape = cut)) + 
  xlab('Carat') + 
  ylab('Price')

# can put all in 1 aes
ggplot(diamonds, aes(x = carat, y = price, color = cut, shape = cut)) + 
  geom_point() + 
  xlab('Carat') + 
  ylab('Price')

# you can also put in line breaks
ggplot(diamonds, aes(x = carat,
                   y = price,
                   color = cut,
                   shape = cut)) + 
  geom_point() + 
  xlab('Carat') + 
  ylab('Price')

@chendaniely
Copy link
Contributor Author

List two advantages of using a database versus a plaintext file in local storage.

Solution:

https://datasciencebook.ca/reading.html#reading-data-from-a-database

https://datasciencebook.ca/reading.html#why-should-we-bother-with-databases-at-all

Databases are beneficial in a large-scale setting:

They enable storing large data sets across multiple computers with backups.
They provide mechanisms for ensuring data integrity and validating input.
They provide security and data access control.
They allow multiple users to access data simultaneously and remotely without conflicts and errors. For example, there are billions of Google searches conducted daily in 2021 (Real Time Statistics Project 2021). Can you imagine if Google stored all of the data from those searches in a single .csv file!? Chaos would ensue!

@chendaniely
Copy link
Contributor Author

Write an example of an untidy data set (it can be small and use commas to separate values into different columns, and line breaks to separate into rows)). Now write that data in a tidy format.

Solution:

untidy

country, 2000, 2001, 2002
canada, 10, 20, 30

tidy:

country, year, value
canada, 2000, 10
canada, 2001, 20
canada, 2002, 30

@chendaniely
Copy link
Contributor Author

chendaniely commented Jun 1, 2022

If the first 8 lines of a data file that you want to read into R looks like this:

Data collected on 2018-06-23
Vancouver, BC
1    8    5    0    0    0
1    0    0    2    0    0
1    0    0    2    0    9
1    7    0    2    1    0
1    0    4    2    0    0
1    0    0    1    0    0

Fill in the missing pieces to the code below so that you could successfully read it into R (assume the tidyverse library has already been loaded):

traffic_data <- read_...("count.csv", ..., ...)

Solution:

read_table("~/Desktop/data.txt", skip = 2, col_names = FALSE)

Note: I ended up using read_delim in the class review. After running the code to load in the data that was wrong. I also forgot the col_names = FALSE in the class example. Will make sure any example in the actual quiz will be more obvious what the actual delimiter is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant