Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiz 3 Practice Solutions #49

Open
mgyliu opened this issue Jun 28, 2023 · 0 comments
Open

Quiz 3 Practice Solutions #49

mgyliu opened this issue Jun 28, 2023 · 0 comments

Comments

@mgyliu
Copy link
Contributor

mgyliu commented Jun 28, 2023

Note: the order of the questions might be different for each student. Regression questions are excluded (S23). The answers here should be used as guidelines and you should consult the textbook for comprehensive explanations of each concept.

Written - Clustering

Describe a case where clustering would be an appropriate tool. In your description, be sure to give examples of variables you would have in the data set (and what each row in the data set represents). Also describe what insight it would bring from the data. Your answer should be 2-3 sentences long and in your own words.

Answer
Suppose Netflix collects data about user watch time. Variables could include: minutes watched per day, number of sessions per week, and number of unique shows in a year. A data scientist at Netflix can then use clustering to separate users into groups based on this viewership information. They can gain insights about the characteristics of users who tend to watch more/less compared to others.

Textbook: See palmerpenguins example. https://datasciencebook.ca/clustering.html#clustering-1

Written - K-means Elbow Plot

K-means clustering was performed on a data set for K's from 1 to 9. Given the elbow plot below, choose the best K for this data set? Explain your answer in 1-2 sentences.

Answer
The best K seems to be 2. The Total WSSD levels off after K=2, and it looks like the elbow on the plot.

Textbook: https://datasciencebook.ca/clustering.html#choosing-k

Coding - Faithful dataset

Given the faithful data set in R (previewed below), fill in the blanks in the code below to perform Kmeans clustering with K = 2. Modify the code to ensure that the K-means analysis below is reproducible.

head(faithful)
eruptions waiting
    3.600      79
    1.800      54
    3.333      74
    2.283      62
    4.533      85
    2.883      55
...
kmeans(faithful, ..., ...)

Answer

set.seed(123)
kmeans(faithful, centers = 2, nstart = 10)

Textbook: https://datasciencebook.ca/clustering.html#k-means-in-r

Written - Explain point estimate

In your own word, explain what a point estimate is and why they are useful? Your answer should be 2-3 sentences long and in your own words.

Answer
A point estimate is a value we compute from a sample that gives us an estimate of a population parameter. If we take multiple samples of the same size from our population and compute a point estimate for each sample, we can obtain a sampling distribution. We can then use the sampling distribution constructed by our estimates to report how confident we are in our estimate.

Textbook: https://datasciencebook.ca/inference.html#why-do-we-need-sampling

Written - Explain bootstrapping

In your own words, explain the bootstrapping process.

Answer

  1. Draw a sample of size $n$ from the population.
  2. Create a bootstrap sample of size $n$ by drawing $n$ observations from our sample in step (1) with replacement
  3. Compute a point estimate on the bootstrap sample from step 2
  4. Repeat 1-3 many times (say $N$ times) and plot a bootstrap distribution using the $N$ point estimates

Textbook: https://datasciencebook.ca/inference.html#bootstrapping

Written - Purpose of bootstrapping

In your own words, briefly describe the purpose of boostrapping in inference.

Answer
We use bootstrapping because most of the time it is not feasible (too expensive, too resource-intensive) to obtain multiple samples from our population.

Textbook: https://datasciencebook.ca/inference.html#bootstrapping

Written - Explain WSSD

Describe how to compute the total within-cluster sum of squared distances (total WSSD) in K-means clustering, and what it is used for. Answer in 2-3 sentences in your own words.

Answer
The total WSSD is the sum of the squared distances between each data point and its cluster centroid. In K-means clustering, the WSSD is used to either select the best clustering for a particular value of $k$, or to plot an elbow plot to select the best value of $k$.

Textbook: https://datasciencebook.ca/clustering.html#measuring-cluster-quality

Written - Explain K-means

In your own words, describe the K-means clustering algorithm, including all of its major steps.

Answer
K-means is an iterative procedure. First, initialize $k$ random centroids. Assign data points to centroids based on whichever one is closest (by Euclidean distance). Re-compute centroids using the center of each newly assigned cluster of data points. Repeat this process until the centroids do not move anymore.

Due to the random initialization of the $k$ centroids in the first step, we typically repeat this process several times to avoid obtaining a particularly bad clustering by chance.

Textbook: https://datasciencebook.ca/clustering.html#k-means

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant