bootstrap_R.Rmd


## Assessing assumptions

- Our $t$-tests assume normality of variable being tested
- but, Central Limit Theorem says that normality matters less if sample is "large"
- in practice "approximate normality" is enough, but how do we assess whether what we have is normal enough?
- so far, use histogram/boxplot and make a call, allowing for sample size.

## What actually has to be normal

- is: **sampling distribution of sample mean**
- the distribution of sample mean over *all possible samples*
- but we only have *one* sample!
- Idea: assume our sample is representative of the population, and draw samples from our sample (!), with replacement.
- This gives an idea of what different samples from the population might look like.
- Called *bootstrap*, after expression "to pull yourself up by your own bootstraps".

## Blue Jays attendances

```{r, echo=FALSE, message=FALSE}
jays <- read_csv("jays15-home.csv")
set.seed(457299)
```

```{r}
jays$attendance
```

- A bootstrap sample:

```{r}
s <- sample(jays$attendance, replace = TRUE)
s
```

## Getting mean of bootstrap sample

- A bootstrap sample is same size as original, but contains repeated values (eg. 15062) and missing ones (42917).
- We need the mean of our bootstrap sample:

```{r}
mean(s)
```

- This is a little different from the mean of our actual sample:

```{r}
mean(jays$attendance)
```

- Want a sense of how the sample mean might vary, if we were able to take repeated samples from our population.
- Idea: take lots of *bootstrap* samples, and see how *their* sample means vary.

## Taking lots of bootstrap samples

- This is the same idea as simulating power, using `rowwise`:
  - set up dataframe with column `sim` to label the simulations
  - generate a bootstrap sample from the data for each `sim`
  - work out the mean of each sample
  - (then) plot them.

```{r, echo=FALSE, message=FALSE}
set.seed(457299)
```

```{r}
tibble(sim = 1:1000) %>%
  rowwise() %>% 
  mutate(boot_sample = 
           list(sample(jays$attendance, replace = TRUE))) %>% 
  mutate(mean = mean(boot_sample)) -> boots
```

## The results

```{r}
boots
```


## Are these normal?

```{r}
ggplot(boots, aes(x=mean)) + geom_histogram(bins=10)
```

## Comments

- This is very close to normal
- The bootstrap says that the sampling distribution of the sample mean is close to normal, even though the distribution of the data is not
- A sample size of 25 is big enough to overcome the skewness that we saw
- This is the Central Limit Theorem in practice
- It is surprisingly powerful.
- Thus, the $t$-test is actually perfectly good here.

## Two samples

- Assumption: *both* samples are from a normal distribution.
- In practice, each sample is "normal enough" given its sample size, since Central Limit Theorem will help.
- Use bootstrap on each group independently, as above.

## Kids learning to read

```{r, echo=FALSE, message=FALSE}
my_url <- "http://www.utsc.utoronto.ca/~butler/c32/drp.txt"
kids <- read_delim(my_url," ")
```

```{r}
ggplot(kids, aes(x=group, y=score)) + geom_boxplot()
```


## Getting just the control group 


```{r}
kids %>% filter(group=="c") -> controls
controls
```

## Bootstrap these

```{r}
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot = 
           list(sample(controls$score, replace = TRUE))) %>% 
  mutate(mean = mean(boot)) -> boots
```

## Plot

```{r}
ggplot(boots, aes(x = mean)) + geom_histogram(bins=10)
```

## ... and the treatment group:

```{r}
kids %>% filter(group=="t") -> treats
tibble(sim = 1:1000) %>% 
  rowwise() %>% 
  mutate(boot = 
           list(sample(treats$score, replace = TRUE))) %>% 
  mutate(mean = mean(boot)) -> boots
```

## Histogram

```{r}
ggplot(boots, aes(x = mean)) + geom_histogram(bins = 10)
```

## Comments

- sampling distributions of sample means both look pretty normal
- as we thought, no problems with our two-sample $t$ at all.