Skip to content

Latest commit

 

History

History
114 lines (65 loc) · 7.18 KB

07-statistics.md

File metadata and controls

114 lines (65 loc) · 7.18 KB

Statistics

Read Allen Downey's Think Stats (second edition) and Think Bayes for getting up to speed with core ideas in statistics and how to approach them programmatically. Both books are completely available online, or you can buy physical copies if you would like.

<img src="img/think_bayes.png" title="Think Bayes" style="float: left"; />

Instructions

The ThinkStats book is approximately 200 pages in length. It is recommended you read the entire book, particularly if you are less familiar with introductory statistical concepts.

The stats exercises have been chosen to introduce/solidify some relevant statistical concepts related to data science. The solutions for these exercises are available in the ThinkStats repository on GitHub. You should focus on understanding the statistical concepts, python programming and interpreting the results. If you are stuck, review the solutions and recode the python in a way that is more understandable to you.

For example, in the first exercise, the author has already written a function to compute Cohen's D. You could import it, or you could write your own to practice python and develop a deeper understanding of the concept.

Complete the following exercises along with the questions in this file. They come from Think Stats, and some can be solved using code provided with the book. The preface of Think Stats explains how to use the code.

Communicate the problem, how you solved it, and the solution, within each of the following markdown files. (You can include code blocks and images within markdown.)


Instructions for cloning the repo

Using the code referenced in the book, follow the step-by-step instructions below.

Step 1. Create a directory on your computer where you will do the prework. Below is an example:

(Mac):      /Users/yourname/ds/metis/prework  
(Windows):  C:/ds/metis/prework

Step 2. cd into the prework directory. Use GitHub to pull this repo to your computer.

$ git clone https://github.com/AllenDowney/ThinkStats2.git

Step 3. Put your ipython notebook or python code files in this directory (that way, it can pull the needed dependencies):

(Mac):     /Users/yourname/ds/metis/prework/ThinkStats2/code  
(Windows):  C:/ds/metis/prework/ThinkStats2/code

###Required Exercises

Include your Python code, results and explanation (where applicable).

###Q1. Think Stats Chapter 2 Exercise 4 (effect size of Cohen's d)
Cohen's D is an example of effect size. Other examples of effect size are: correlation between two variables, mean difference, regression coefficients and standardized test statistics such as: t, Z, F, etc. In this example, you will compute Cohen's D to quantify (or measure) the difference between two groups of data.

You will see effect size again and again in results of algorithms that are run in data science. For instance, in the bootcamp, when you run a regression analysis, you will recognize the t-statistic as an example of effect size.

###Q2. Think Stats Chapter 3 Exercise 1 (actual vs. biased) This problem presents a robust example of actual vs biased data. As a data scientist, it will be important to examine not only the data that is available, but also the data that may be missing but highly relevant. You will see how the absence of this relevant data will bias a dataset, its distribution, and ultimately, its statistical interpretation.

###Q3. Think Stats Chapter 4 Exercise 2 (random distribution)
This questions asks you to examine the function that produces random numbers. Is it really random? A good way to test that is to examine the pmf and cdf of the list of random numbers and visualize the distribution. If you're not sure what pmf is, read more about it in Chapter 3.

###Q4. Think Stats Chapter 5 Exercise 1 (normal distribution of blue men) This is a classic example of hypothesis testing using the normal distribution. The effect size used here is the Z-statistic.

###Q5. Bayesian (Elvis Presley twin)

Bayes' Theorem is an important tool in understanding what we really know, given evidence of other information we have, in a quantitative way. It helps incorporate conditional probabilities into our conclusions.

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin? Assume we observe the following probabilities in the population: fraternal twin is 1/125 and identical twin is 1/300.

5/17.
Let probability of having an identical twin=P(A)=1/300.
Probability of having any twin=P(B)=1/300+1/125=17/1500.
Obviously, P(B/A)=1, since (A) is a subset of (B).
Therefore, P(A/B)=P(A)P(B/A)/P(B)=(1/300)/(17/1500)=5/17.
Alternatively, if we take into account the fact that we know his twin was a boy and a female identical twin is impossible, the answer becomes 5/11.


###Q6. Bayesian & Frequentist Comparison
How do frequentist and Bayesian statistics compare?

Frequentist statistics inputs data and comes up with an optimal response based on a pre-existing model. Bayesian statistics, however, will make more extensive use of the background of the inputted data itself to constantly update the model.


###Optional Exercises

The following exercises are optional, but we highly encourage you to complete them if you have the time.

###Q7. Think Stats Chapter 7 Exercise 1 (correlation of weight vs. age) In this exercise, you will compute the effect size of correlation. Correlation measures the relationship of two variables, and data science is about exploring relationships in data.

###Q8. Think Stats Chapter 8 Exercise 2 (sampling distribution) In the theoretical world, all data related to an experiment or a scientific problem would be available. In the real world, some subset of that data is available. This exercise asks you to take samples from an exponential distribution and examine how the standard error and confidence intervals vary with the sample size.

###Q9. Think Stats Chapter 6 Exercise 1 (skewness of household income) ###Q10. Think Stats Chapter 8 Exercise 3 (scoring) ###Q11. Think Stats Chapter 9 Exercise 2 (resampling)


More Resources

Some people enjoy video content such as Khan Academy's Probability and Statistics or the much longer and more in-depth Harvard Statistics 110. You might also be interested in the book Statistics Done Wrong or a very short overview from School of Data.