Skip to content

Latest commit

 

History

History
143 lines (78 loc) · 9.8 KB

exercises.md

File metadata and controls

143 lines (78 loc) · 9.8 KB

Exercises

These exercises are a great way to both prepare you for the bootcamp and for you to assess if you've absorbed the prework content and are fully ready. There are 6 required and 6 optional problems. We chose them to go along with your prework statistics training. They require (and teach) both python and statistics skills. Have fun!

If you can finish your work on these required exercises and submit them before the first day of the bootcamp, it would be perfect, and you can give your full focus and energy to the first week. However, if you were not able to finish them before then, you can still work on them during the first days and submit them by the end of the first week. In any case, please keep your instructors informed of your progress. If you have questions about the exercises or anything else, you can ask them on the [pre-work support thread] on Discourse (so others can see the answers, too). If you'd rather ask in private, you can send a PM (private message) to either instructor on Discourse, or even email them.

The answers can be in any form that shows your work: it can be python files and graphs, you can paste your code and results in an email, you can attach an ipython notebook. We will accept any medium as long as we can see your answers.

Once you are ready to submit, please email the results to your instructors, Irmak and Bo, at irmak@datasco.pe and bo@datasco.pe.

Setup

As mentioned in the Preface of Think Stats (section Using the Code), there is some accompanying code and data. You can get these from the Think Stats repository. Using the Code explains different ways of getting these files if you are unfamiliar with github. This repository also includes some ipython notebooks.

We will learn, use and get very familiar with ipython notebooks in class, but if you want to learn more about them ahead of time to use them for these exercises, you can check out the documentation. You can also optionally try your hand at Think Stats Exercise 1.1, which gives you an ipython notebook and asks a few questions.

Required Exercises

1) Think Stats Exercise 2.4

Using the variable totalwgt_lb, investigate whether first babies are lighter or heavier than others. Compute Cohen’s d to quantify the difference between the groups. How does it compare to the difference in pregnancy length?

2) Think Stats Exercise 3.1

Something like the class size paradox appears if you survey children and ask how many children are in their family. Families with many children are more likely to appear in your sample, and families with no children have no chance to be in the sample.

Use the NSFG respondent variable NUMKDHH to construct the actual distribution for the number of children under 18 in the household.

Now compute the biased distribution we would see if we surveyed the children and asked them how many children under 18 (including themselves) are in their household.

Plot the actual and biased distributions, and compute their means. As a starting place, you can use chap03ex.ipynb. This is an ipython notebook from the ThinkStats2 repository.

3) Think Stats Exercise 4.2

The numbers generated by random.random are supposed to be uniform between 0 and 1; that is, every value in the range should have the same probability.

Generate 1000 numbers from random.random and plot their PMF and CDF. Is the distribution uniform?

4) Think Stats Exercise 7.1

Using data from the NSFG, make a scatter plot of birth weight versus mother’s age. Plot percentiles of birth weight versus mother’s age. Compute Pearson’s and Spearman’s correlations. How would you characterize the relationship between these variables?

5) Think Stats Exercise 8.2

Suppose that you draw a sample with size n = 10 from an exponential distribution with λ = 2. Simulate this experiment 1000 times and plot the sampling distribution of the estimate L. Compute the standard error of the estimate and the 90% confidence interval.

Repeat the experiment with a few different values of n and make a plot of standard error versus n.

6) Think Bayes Exercise 2.1

The cookie problem is a problem discussed in sections 1.3, 2.2 and 2.3 of Think Bayes. Solve the following problem. In Section 2.3 I said that the solution to the cookie problem generalizes to the case where we draw multiple cookies with replacement. But in the more likely scenario where we eat the cookies we draw, the likelihood of each draw depends on the previous draws.

Modify the solution in this chapter to handle selection without replacement. Hint: add instance variables to Cookie to represent the hypothetical state of the bowls, and modify Likelihood accordingly. You might want to define a Bowl object.

Optional Exercises

1) Think Stats Exercise 5.1

In the BRFSS (see Section 5.4), the distribution of heights is roughly normal with parameters µ = 178 cm and σ = 7.7 cm for men, and µ = 163 cm and σ = 7.3 cm for women.

In order to join Blue Man Group, you have to be male between 5'10" and 6'1" (see their webpage). What percentage of the U.S. male population is in this range? Hint: use scipy.stats.norm.cdf.

2) Think Stats Exercise 6.1

The distribution of income is famously skewed to the right. In this exercise, we’ll measure how strong that skew is.

The Current Population Survey (CPS) is a joint effort of the Bureau of Labor Statistics and the Census Bureau to study income and related variables. Data collected in 2013 is available from the Census Burea’s website. I downloaded hinc06.xls, which is an Excel spreadsheet with information about household income, and converted it to hinc06.csv, a CSV file you will find in the repository for this book. You will also find hinc2.py, which reads this file and transforms the data.

The dataset is in the form of a series of income ranges and the number of respondents who fell in each range. The lowest range includes respondents who reported annual household income “Under $5000.” The highest range includes respondents who made “$250,000 or more.”

To estimate mean and other statistics from these data, we have to make some assumptions about the lower and upper bounds, and how the values are distributed in each range. hinc2.py provides InterpolateSample, which shows one way to model this data. It takes a DataFrame with a column, income, that contains the upper bound of each range, and freq, which contains the number of respondents in each frame.

It also takes log_upper, which is an assumed upper bound on the highest range, expressed in log10 dollars. The default value, log_upper=6.0 represents the assumption that the largest income among the respondents is 106, or one million dollars.

InterpolateSample generates a pseudo-sample; that is, a sample of household incomes that yields the same number of respondents in each range as the actual data. It assumes that incomes in each range are equally spaced on a log10 scale.

Compute the median, mean, skewness and Pearson’s skewness of the resulting sample. What fraction of households reports a taxable income below the mean? How do the results depend on the assumed upper bound?

3) Think Stats Exercise 8.3

In games like hockey and soccer, the time between goals is roughly exponential. So you could estimate a team’s goal-scoring rate by observing the number of goals they score in a game. This estimation process is a little different from sampling the time between goals, so let’s see how it works.

Write a function that takes a goal-scoring rate, lam, in goals per game, and simulates a game by generating the time between goals until the total time exceeds 1 game, then returns the number of goals scored. Write another function that simulates many games, stores the estimates of lam, then computes their mean error and RMSE.

Is this way of making an estimate biased? Plot the sampling distribution of the estimates and the 90% confidence interval. What is the standard error? What happens to sampling error for increasing values of lam?

4) Think Stats Exercise 9.2

In “Testing a Difference in Means” on page 104, we simulated the null hypothesis by permutation; that is, we treated the observed values as if they represented the entire population, and randomly assigned the members of the population to the two groups.

An alternative is to use the sample to estimate the distribution for the population, then draw a random sample from that distribution. This process is called resampling. There are several ways to implement resampling, but one of the simplest is to draw a sample with replacement from the observed values, as in “Power” on page 112.

Write a class named DiffMeansResample that inherits from DiffMeansPermute and overrides RunModel to implement resampling, rather than permutation.

Use this model to test the differences in pregnancy length and birth weight. How much does the model affect the results?

5) The Elvis Twin Problem

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

To answer this one, you need some background information: According to the Wikipedia article on twins: "Twins are estimated to be approximately 1.9% of the world population, with monozygotic twins making up 0.2% of the total---and 8% of all twins."

6) The Locomotive Problem

A railroad numbers its locomotives in order 1..N. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has.

Hint: Think Bayes Chapter 3 actually solves this ambiguous looking problem. It’s a pretty cool problem to solve with a Bayesian approach. Try thinking about it and coming up with an answer before looking at the chapter.