11-Simulation.qmd

---
author: 
  - Matthew J. C. Crump
aliases: [simulating-data.html]
---

```{r, include = FALSE}
source("global_stuff.R")
```

# Simulating Data

```{r}
library(ggplot2)
library(dplyr)
```

You may have noticed that throughout this book so far we have analyzed a lot of fake data. We used R to simulate pretend numbers, and then we analyzed those numbers. We also, from time to time, loaded in some "real" data, and analyzed that. In your labs each week, you have been analyzing a lot of real data. You might be thinking that the simulations we ran were just for educational purposes, to show you how things work. That's partly true, that's one reason we ran so many simulations. At the same time, conducting simulations to understand how data behaves is a legitimate branch of statistics. There are some problems out there where we don't have really good analytic math formulas to tell us the correct answer, so we create and run simulations to approximate the answer.

I'm going to say something mildy controversial right now: If you can't simulate your data, then you probably don't really understand your data or how to analyze it. Perhaps, this is too bold of a statement. There are many researchers out there who have never simulated their data, and it might be too much too claim that they don't really understand their data because they didn't simulate. Perhaps. There are also many students who have taken statistics classes, and learned how to press some buttons, or copy some code, to analyze some real data; but, who never learned how to run simulations. Perhaps my statement applies more to those students, who I believe would benefit greatly from learning some simulation tricks.

## Reasons to simulate

There are many good reasons to learn simulation techniques, here are some:

1.  You force yourself to consider the details of your design, how many subjects, how many conditions, how many observations per condition per subject, and how you will store and represent the data to describe all of these details when you run the experiment

2.  You force yourself to consider the kinds of numbers you will be collecting. Specifically, the distributional properties of those numbers. You will have to make decisions about the distributions that you sample from in your simulation, and thinking about this issue helps you better understand your own data when you get it.

3.  You learn a bit of computer programming, and this is a very useful general skill that you can build upon to do many things.

4.  You can make reasonable and informed assumptions about how your experiment might turn out, and then use the results of your simulation to choose parameters for your design (such as number of subjects, number of observations per condition and subject) that will improve the sensitivity of your design to detect the effects you are interested in measuring.

5.  You can even run simulations on the data that you collect to learn more about how it behaves, and to do other kinds of advanced statistics that we don't discuss in this book.

6.  You get to improve your intuitions about how data behaves when you measure it. You can test your intuitions by running simulations, and you can learn things you didn't know to begin with. Simulations can be highly informative.

7.  When you simulate data in advance of collecting real data, you can work out exactly what kinds of tests you are planning to perform, and you will have already written your analysis code, so it will be ready and waiting for you as soon as you collect the data

OK, so that's just a few reasons why simulations are useful.

## Simulation overview

The basic idea here is actually pretty simple. You make some assumptions about how many subjects will be in your design (set N), you make some assumptions about the distributions that you will be sampling your scores from, then you use R to fabricate fake data according to the parameters you set. Once you build some simulated data, you can conduct a statistical analysis that you would be planning to run on the real data. Then you can see what happens. More importantly, you can repeat the above process many times. This is similar to conducting a replication of your experiment to see if you find the same thing, only you make the computer replicate your simulation 1000s of times. This way you can see how your simulated experiment would turn out over the long run. For example, you might find that the experiment you are planning to run will only produce a "signficant" result 25% of the time, that's not very good. Your simulation might also tell you that if you increase your N by say 25, that could really help, and your new experiment with N=25 might succeed 90% of the time. That's information worth knowing.

Before we go into more simulation details, let's just run a quick one. We'll do an independent samples $t$-test. Imagine we have a study with N=10 in each group. There are two groups. We are measuring heart rate. Let's say we know that heart rate is on average 100 beats per minute with a standard deviation of 7. We are going to measure heart rate in condition A where nothing happens, and we are going to measure heart rate in condition B while they watch a scary movie. We think the scary movie might increase heart rate by 5 beats per minute. Let's run a simulation of this:

```{r, echo=TRUE}
group_A <- rnorm(10,100,7)
group_B <- rnorm(10,105, 7)
t.test(group_A,group_B,var.equal = TRUE)

```

We sampled 10 scores from a normal distribution for each group. We changed the mean for group_b to 105, because we were thinking their heart rate would be 5 more than group A. We ran one $t$-test, and we got a result. This result tells us what happens for this one simulation.

We could learn more by repeating the simulation 1000 times, saving the $p$-values from each replication, and then finding out how many of our 1000 simulated experiments give us a significant result:

```{r, echo=TRUE}
save_ps<-length(1000)
for(i in 1:1000){
  group_A <- rnorm(10,100,7)
  group_B <- rnorm(10,105, 7)
  t_results <- t.test(group_A,group_B,var.equal = TRUE)
  save_ps[i] <- t_results$p.value
}

prop_p<-length(save_ps[save_ps<0.05])/1000
print(prop_p)

```

Now this is more interesting. We found that `r prop_p*100`% of simulated experiments had a $p$-value less than 0.05. That's not very good. If you were going to collect data in this kind of experiment, and you made the correct assumptions about the mean and standard deviation of the distribution, and you made the correct assumption about the size of difference between the groups, you would be planning to run an experiment that would not work-out most of the time.

What happens if we increase the number of subject to 50 in each group?

```{r, echo=TRUE}
save_ps<-length(1000)
for(i in 1:1000){
  group_A <- rnorm(50,100,7)
  group_B <- rnorm(50,105, 7)
  t_results <- t.test(group_A,group_B,var.equal = TRUE)
  save_ps[i] <- t_results$p.value
}

prop_p<-length(save_ps[save_ps<0.05])/1000
print(prop_p)

```

Ooh, look, almost all of the experiments are significant now. So, it would be better to use 50 subjects per group than 10 per group according to this simulation.

Of course, you might already be wondering so many different kinds of things. How can we plausibly know the parameters for the distribution we are sampling from? Isn't this all just guess work? We'll discuss some of these issues as we move forward in this chapter.

````{=html}
<!--
## A general model

It's probably time we introduce you to another new concept. This concept has been underyling everything we have been doing so far. It's called the general linear model, or GLM for short. Don't worry it's not that scary. And, right now, we won't dive off into the deep end. Instead, we will do a little bit of light GLM, for beginners. The purpose is to give you some idea where the main effects and interaction come from, and when they are there, how they influence the pattern in the means.

The GLM makes you think about the various things that influence each mean you measure. It's the idea that any mean is the sum of it's parts: a little bit of this, a little bit of that. We could write a GLM formula for most any kind of design. However, we will start with a 2x2 design. After all, we jsut spent two chapters talking about 2x2 designs, let's look at the general formula idea for a 2x2.


$\text{A mean} = \text{Grand Mean} + \text{IV1 effect} + \text{IV2 effect} + \text{Interaction effect} + \text{error}$

Let's say you ran an experiment with two IVs, but they did absolutely nothing, and there was no interaction. What would you expect your means to be in each of the four conditions. You would still measure a bunch of scores, but they should all be the same right? If so, then the grand mean of all the scores should be your mean for all of the conditions. The formula would look like this:

$\text{Each mean} = \text{Grand Mean} + 0 + 0 + 0 + \text{error}$

Remember there will still be some sampling error, so each mean will be the grand mean, plus or minus some amount of random sampling error.

How about if IV1 has an effect of +10. That is, the means in level 1 of IV1 are +10 more than the means of level 2 of IV1. Let's say that is the only thing going on. Then we would expect the following:

$\text{IV1: Level 1 mean} = \text{Grand Mean} + 10 + 0 + 0 + \text{error}$

$\text{IV1: Level 2 mean} = \text{Grand Mean} + 0 + 0 + 0 + \text{error}$

Or, we could write it this way to make the effect of IV1 balanced about the Grand mean:

$\text{IV1: Level 1 mean} = \text{Grand Mean} + 5 + 0 + 0 + \text{error}$

$\text{IV1: Level 2 mean} = \text{Grand Mean} + -5 + 0 + 0 + \text{error}$

Alright, the general idea here is that each individual mean can be described in terms of sum of it's parts, where the parts include the grand mean, and deviations from the grand mean.

Let's look at some numbers. We'll imagine IV1 has two levels, A and B, and IV2 has two levels 1 and 2. The table below reports the means in each condition. It also shows you how each mean is the sum of the grand mean, and the means for main effects and interaction

```{r}

grand_m   <-c(5,5,5,5)
IV1_m     <-c(-2,-2,2,2)
IV2_m     <-c(3,-3,3,-3)
IV1xIV2_m <-c(0,0,0,0)
IV1       <-c("A","A","B","B")
IV2       <-as.factor(c(1,2,1,2))
means     <-grand_m+IV1_m+IV2_m+IV1xIV2_m

df  <- data.frame(IV1, IV2, means,
                  grand_m, IV1_m, IV2_m,IV1xIV2_m)

ggplot(df, aes(x=IV1, y=means, group=IV2, color=IV2))+
  geom_point()+
  geom_line()+
  theme_classic()

knitr::kable(df)

```


Great, we get to look at the data in two ways. First, look at the table. The four means are listed in the means column. Each number is the sum of the grand mean, the IV1 mean, the IV2 mean, and the in the interaction mean. So, for example, $6 = 5 + (-2) + 3 + 0$, and $0 = 5 + (-2) + (-3) + 0$. The other rows add up to 10 and 4.

We can also work in reverse. What is the grand mean of our scores?

$(6+0+10+4)/4 =20/4 = 5$, that's the same number as listed in the `grand_m` column. 


Next, you can see from the graph that there is a main effect for IV1, both of the points for group A are lower than the points for group B. How big is this main effect? We should be able to figure this out from the table. The table has -2 and +2 for the `IV1_m`. These numbers represent deviations from the grand mean, caused by the IV1 manipulation. The difference between -2 and 2 is 4, so the size of the main effect should be 4. If this is true, we should find that the difference between the means for Group A and B is also 4.

- Group A mean = (6+0)/2 = 3

- Group B mean = (10+4)/2 = 7

Well, 7-3 = 4, which is the difference between the means for IV1. We could do the same thing for IV2, which also shows a main effect. We can see that the main effect must be a difference of 6, that the difference between -3 and 3 in the table.

Finally, notice that the interaction column at the end is all 0s. As a result, there is no interaction. You can see this in the graph. The lines are parallel, no interaction. 

### What are we doing here?

When you make predictions for how your data might turn out, one way to do it is just to write down what you think the mean would be for each condition. For example, I could have predicted that the means for the above would be 6, 0, 10, and 4, for each of the four conditions.

What we are doing with the GLM approach is making these same predictions, but doing them in terms of the breakdown of the effects we are analzying for later in the ANOVA. Consider the steps like this:

1. What is your global prediction for all the data, ignoring all of the conditions in your experiment? This is a prediction about the grand mean that you will get. Start your GLM by predicting the grand mean.

2. What is your prediction for the effect of the first IV? This prediction is really that the mean for each group will be different from the grand mean. The grand mean will always be in the center of the data, so your predictions for the main effect of IV1 must deviate from the center. If I want to predict a total difference of 4, then half of th

-->
````

## Simulating t-tests

We've already seen some code for simulating a $t$-test 1000 times, saving the $p$-values, and then calculating the proportion of simulations that are significant (p\<0.05). It looked like this:

```{r, echo=TRUE}

save_ps<-length(1000)
for(i in 1:1000){
  group_A <- rnorm(50,100,7)
  group_B <- rnorm(50,105, 7)
  t_results <- t.test(group_A,group_B,var.equal = TRUE)
  save_ps[i] <- t_results$p.value
}

prop_p<-length(save_ps[save_ps<0.05])/1000
print(prop_p)

```

You could play around with that, and it would be very useful I think. Is there anything else that we can do that would be more useful? Sure there is. With the above simulation, you have to change N or the mean difference each time to see how proportion of significant experiments turns out. It would be nice to look at a graph where we could vary the number of subjects, and the size of the mean difference. That's what the next simulation does. This kind of simulation can make your computer do some hard work depening on how many simulations you run. To make my computer do less work, we will only run 100 simulations for each parameter. But, what we will do is vary the number of subjects from 10 to 50 (steps of 10), and vary the size of the effect from 0 to 20 in steps of 4.

```{r}
#| label: fig-11simt
#| fig-cap: "A simulation showing the proportion of significant $t$ tests as a function of the programmed mean difference and number of subjects."


num_sims        <-500
N               <-c(10,20,30,40,50)
mean_difference <-c(0,4,8,12,16,20)
save_ps<-length(num_sims)

all_df<-data.frame()
for(diff in mean_difference){
  for (j in N){
    for(i in 1:num_sims){
      group_A <- rnorm(j,100,7)
      group_B <- rnorm(j,100+diff, 7)
      t_results <- t.test(group_A,group_B,var.equal = TRUE)
      save_ps[i] <- t_results$p.value
    }
    sim_df <-data.frame(save_ps,
                        num_subjects=as.factor(rep(j,num_sims)),
                        mean_diff =rep(diff,num_sims))
    all_df <- rbind(all_df,sim_df)
  }
}  

plot_df <- all_df %>%
            dplyr::group_by(num_subjects,mean_diff) %>%
            dplyr::summarise(proportion_sig = length(save_ps[save_ps<0.05])/num_sims)


ggplot(plot_df, aes(x=mean_diff, 
                    y=proportion_sig, 
                    group=num_subjects, 
                    color=num_subjects))+
  geom_point()+
  geom_line()+
  theme_classic()

```

A graph like @fig-11simt is very helpful to look at. Generally, before we run an experiment we might not have a very good idea of the size of the effect that our manipulation might cause. Will it be a mean difference of 0 (no effect), or 5, or 10, or 20? If you are doing something new it's hard to know. You would know in general that bigger effects are easier to detect. You would be able to detect smaller and smaller effects if you ran more and more subjects. When you run this kind of simulation, you can vary the possible mean differences and the number of the subjects at the same time and then see what happens.

When the mean difference is 0, we should get an average of 5%, or (0.05 proportion) experiments being significant. This is what we expect by chance, and it doesn't matter how many subjects we run. When there is no difference, we will reject the null 5% of the time (these would all be type 1 errors).

How about when there is a difference of 4? This a pretty small effect. If we only run 10 subjects in each group, we can see that less than 25% of simulated experiments would show significant results. If we wanted a higher chance of success to measure an effect of this size, then we should go up to 40-50 subjects, that would get us around 75% success rates. If that's not good enough for you (25% failures remember, that's still alot), then re-run the simulation with even more subjects.

Another thing worth pointing out is that if the mean difference is bigger than about 12.5, you can see that all of the designs produce significant outcomes nearly 100% of the time. If you knew this, perhaps you would simply run 10-20 subjects in your experiment, rather than 50. After all, 10-20 is just fine for detecting the effect, and 50 subjects might be a waste of resources (both yours and your participants).

## Simulating one-factor ANOVAs

The following builds simulated data for a one-factor ANOVA, appropriate for a between subjects design. We build the data frame containg a column for the group factor levels, and a column for the DV. Then, we run the ANOVA an print it out.

```{r, echo=TRUE}
N <- 10
groups <- rep(c("A","B","C"), each=10)
DV <- c(rnorm(100,10,15),   # means for group A
        rnorm(100,10,15),   # means for group B
        rnorm(100,20,15)    # means for group C
        )
sim_df<-data.frame(groups,DV)

aov_results <- summary(aov(DV~groups, sim_df))

library(xtable)
knitr::kable(xtable(aov_results))

```

In this next example, we simulate the same design 100 times, save the $p$-values, and the determine the proportion of significant simulations.

```{r, echo=TRUE}

N <- 10

save_p<-length(100)
for(i in 1:100){
  groups <- rep(c("A","B","C"), each=10)
  DV <- c(rnorm(100,10,15),   # means for group A
          rnorm(100,10,15),   # means for group B
          rnorm(100,20,15)    # means for group C
          )
  sim_df<-data.frame(groups,DV)
  
  aov_results <- summary(aov(DV~groups, sim_df))
  save_p[i]<-aov_results[[1]]$`Pr(>F)`[1]
}

length(save_p[save_p<0.05])/100

```

## Other resources

OK, It's a tuesday, the summer is almost over. I've spent most of this summer (2018) writing this textbook, because we are using it this Fall 2018. Because I am running out of time, I need to finish this and make sure everything is in place for the course to work. As a result, I am not going to finish this chapter right now. The nice thing about this book, is that I (and other people) can fill things in over time. We have shown a few examples of data-simulation, so that's at least something.

If you want to see more examples, I suggest you check out this chapter:

<https://crumplab.github.io/programmingforpsych/simulating-and-analyzing-data-in-r.html#simulating-data-for-multi-factor-designs>

This section will get longer as I find more resources to add, and hopefully the entire chapter will get longer as I add in more examples over time.