/
06-ttests.qmd
1260 lines (868 loc) · 79.4 KB
/
06-ttests.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
author:
- Matthew J. C. Crump
aliases: [t-tests.html]
---
```{r, include = FALSE}
source("global_stuff.R")
```
# t-tests
```{r, echo=FALSE}
library(ggplot2)
```
Back in the day, William Sealy Gosset got a job working for Guinness Breweries. They make the famous Irish stout called Guinness. What happens next went something like this (total fabrication, but mostly on point).
Guinness wanted all of their beers to be the best beers. No mistakes, no bad beers. They wanted to improve their quality control so that when Guinness was poured anywhere in the world, it would always comes out fantastic: 5 stars out of 5 every time, the best.
Guinness had some beer tasters, who were super-experts. Every time they tasted a Guinness from the factory that wasn't 5 out of 5, they knew right away.
But, Guinness had a big problem. They would make a keg of beer, and they would want to know if every single pint that would come out would be a 5 out of 5. So, the beer tasters drank pint after pint out of the keg, until it was gone. Some kegs were all 5 out of 5s. Some weren't, Guinness needed to fix that. But, the biggest problem was that, after the testing, there **was no beer left to sell**, the testers drank it all (remember I'm making this part up to illustrate a point, they probably still had beer left to sell).
Guinness had a sampling and population problem. They wanted to know that the entire population of the beers they made were all 5 out of 5 stars. But, if they sampled the entire population, they would drink all of their beer, and wouldn't have any left to sell.
Enter William Sealy Gosset. Gosset figured out the solution to the problem. He asked questions like this:
1. How many samples do I need to take to know the whole population is 5 out of 5?
2. What's the fewest amount of samples I need to take to know the above, that would mean Guinness could test fewer beers for quality, sell more beers for profit, and make the product testing time shorter.
Gosset solved those questions, and he invented something called the *Student's t-test*. Gosset was working for Guinness, and could be fired for releasing trade-secrets that he invented (the t-test). But, Gosset published the work anyways, under a pseudonym [@Student1908]. He called himself Student, hence Student's t-test. Now you know the rest of the story.
It turns out this was a very nice thing for Gosset to have done. t-tests are used all the time, and they are useful, that's why they are used. In this chapter we learn how they work.
You'll be surprised to learn that what we've already talked about, (the Crump Test, and the Randomization Test), are both very very similar to the t-test. So, in general, you have already been thinking about the things you need to think about to understand t-tests. You're probably wondering what is this $t$, what does $t$ mean? We will tell you. Before we tell what it means, we first tell you about one more idea.
## Check your confidence in your mean
We've talked about getting a sample of data. We know we can find the mean, we know we can find the standard deviation. We know we can look at the data in a histogram. These are all useful things to do for us to learn something about the properties of our data.
You might be thinking of the mean and standard deviation as very different things that we would not put together. The mean is about central tendency (where most of the data is), and the standard deviation is about variance (where most of the data isn't). Yes, they are different things, but we can use them together to create useful new things.
What if I told you my sample mean was 50, and I told you nothing else about my sample. Would you be confident that most of the numbers were near 50? Would you wonder if there was a lot of variability in the sample, and many of the numbers were very different from 50. You should wonder all of those things. The mean alone, just by itself, doesn't tell you anything about well the mean represents all of the numbers in the sample.
It could be a representative number, when the standard deviation is very small, and all the numbers are close to 50. It could be a non-representative number, when the standard deviation is large, and many of the numbers are not near 50. You need to know the standard deviation in order to be confident in how well the mean represents the data.
How can we put the mean and the standard deviation together, to give us a new number that tells us about confidence in the mean?
We can do this using a ratio:
$\frac{mean}{\text{standard deviation}}$
Think about what happens here. We are dividing a number by a number. Look at what happens:
$\frac{number}{\text{same number}} = 1$
$\frac{number}{\text{smaller number}} = \text{big number}$
compared to:
$\frac{number}{\text{bigger number}} = \text{smaller number}$
Imagine we have a mean of 50, and a truly small standard deviation of 1. What do we get with our formula?
$\frac{50}{1} = 50$
Imagine we have a mean of 50, and a big standard deviation of 100. What do we get with our formula?
$\frac{50}{100} = 0.5$
Notice, when we have a mean paired with a small standard deviation, our formula gives us a big number, like 50. When we have a mean paired with a large standard deviation, our formula gives us a small number, like 0.5. These numbers can tell us something about confidence in our mean, in a general way. We can be 50 confident in our mean in the first case, and only 0.5 (not at a lot) confident in the second case.
What did we do here? We created a descriptive statistic by dividing the mean by the standard deviation. And, we have a sense of how to interpret this number, when it's big we're more confident that the mean represents all of the numbers, when it's small we are less confident. This is a useful kind of number, a ratio between what we think about our sample (the mean), and the variability in our sample (the standard deviation). Get used to this idea. Almost everything that follows in this textbook is based on this kind of ratio. We will see that our ratio turns into different kinds of "statistics", and the ratios will look like this in general:
$\text{name of statistic} = \frac{\text{measure of what we know}}{\text{measure of what we don't know}}$
or, to say it using different words:
$\text{name of statistic} = \frac{\text{measure of effect}}{\text{measure of error}}$
In fact, this is the general formula for the t-test. Big surprise!
## One-sample t-test: A new t-test
Now we are ready to talk about t-test. We will talk about three of them. We start with the one-sample t-test.
Commonly, the one-sample t-test is used to estimate the chances that your sample came from a particular population. Specifically, you might want to know whether the mean that you found from your sample, could have come from a particular population having a particular mean.
Straight away, the one-sample t-test becomes a little confusing (and I haven't even described it yet). Officially, it uses known parameters from the population, like the mean of the population and the standard deviation of the population. However, most times you don't know those parameters of the population! So, you have to estimate them from your sample. Remember from the chapters on descriptive statistics and sampling, our sample mean is an unbiased estimate of the population mean. And, our sample standard deviation (the one where we divide by n-1) is an unbiased estimate of the population standard deviation. When Gosset developed the t-test, he recognized that he could use these estimates from his samples, to make the t-test. Here is the formula for the one sample t-test, we first use words, and then become more specific:
### Formulas for one-sample t-test
$\text{name of statistic} = \frac{\text{measure of effect}}{\text{measure of error}}$
$\text{t} = \frac{\text{measure of effect}}{\text{measure of error}}$
$\text{t} = \frac{\text{Mean difference}}{\text{standard error}}$
$\text{t} = \frac{\bar{X}-u}{S_{\bar{X}}}$
$\text{t} = \frac{\text{Sample Mean - Population Mean}}{\text{Sample Standard Error}}$
$\text{Estimated Standard Error} = \text{Standard Error of Sample} = \frac{s}{\sqrt{N}}$
Where, s is the sample standard deviation.
Some of you may have gone cross-eyed looking at all of this. Remember, we've seen it before when we divided our mean by the standard deviation in the first bit. The t-test is just a measure of a sample mean, divided by the standard error of the sample mean. That is it.
### What does t represent?
$t$ gives us a measure of confidence, just like our previous ratio for dividing the mean by a standard deviations. The only difference with $t$, is that we divide by the standard error of mean (remember, this is also a standard deviation, it is the standard deviation of the sampling distribution of the mean)
::: callout-note
What does the t in t-test stand for? Apparently nothing. Gosset originally labelled it z. And, Fisher later called it t, perhaps because t comes after s, which is often used for the sample standard deviation.
:::
$t$ is a property of the data that you collect. You compute it with a sample mean, and a sample standard error (there's one more thing in the one-sample formula, the population mean, which we get to in a moment). This is why we call $t$, a sample-statistic. It's a statistic we compute from the sample.
What kinds of numbers should we expect to find for these $ts$? How could we figure that out?
Let's start small and work through some examples. Imagine your sample mean is 5. You want to know if it came from a population that also has a mean of 5. In this case, what would $t$ be? It would be zero: we first subtract the sample mean from the population mean, $5-5=0$. Because the numerator is 0, $t$ will be zero. So, $t$ = 0, occurs, when there is no difference.
Let's say you take another sample, do you think the mean will be 5 every time, probably not. Let's say the mean is 6. So, what can $t$ be here? It will be a positive number, because 6-5= +1. But, will $t$ be +1? That depends on the standard error of the sample. If the standard error of the sample is 1, then $t$ could be 1, because 1/1 = 1.
If the sample standard error is smaller than 1, what happens to $t$? It get's bigger right? For example, 1 divided by 0.5 = 2. If the sample standard error was 0.5, $t$ would be 2. And, what could we do with this information? Well, it be like a measure of confidence. As $t$ get's bigger we could be more confident in the mean difference we are measuring.
Can $t$ be smaller than 1? Sure, it can. If the sample standard error is big, say like 2, then $t$ will be smaller than one (in our case), e.g., 1/2 = .5. The direction of the difference between the sample mean and population mean, can also make the $t$ become negative. What if our sample mean was 4. Well, then $t$ will be negative, because the mean difference in the numerator will be negative, and the number in the bottom (denominator) will always be positive (remember why, it's the standard error, computed from the sample standard deviation, which is always positive because of the squaring that we did.).
So, that is some intuitions about what the kinds of values t can take. $t$ can be positive or negative, and big or small.
Let's do one more thing to build our intuitions about what $t$ can look like. How about we sample some numbers and then measure the sample mean **and** the standard error of the mean, and then plot those two things against each each. This will show us how a sample mean typically varies with respect to the standard error of the mean.
In @fig-6sampleMSEM, I pulled 1,000 samples of $N = 10$ from a normal distribution (mean = 0, sd = 1). Each time I measured the mean and standard error of the sample. That gave two descriptive statistics for each sample, letting us plot each sample as dot in a scatter plot.
```{r}
#| label: fig-6sampleMSEM
#| fig-cap: "A scatter plot with sample mean on the x-axis, and standard error of the mean on the y-axis"
sample_mean <- length(1000)
sample_se <- length(1000)
for (i in 1:1000) {
s <- rnorm(10, 0, 1)
sample_mean[i] <- mean(s)
sample_se[i] <- sd(s) / sqrt(length(s))
}
plot(sample_mean, sample_se)
```
What we get is a cloud of dots. You might notice the cloud has a circular quality. There's more dots in the middle, and fewer dots as they radiate out from the middle. The dot cloud shows us the general range of the sample mean, for example most of the dots are in between -1 and 1. Similarly, the range for the sample standard error is roughly between .2 and .5. Remember, each dot represents one sample.
We can look at the same data a different way. For example, rather than using a scatter plot, we can divide the mean for each dot by the standard error for each dot. @fig-6thisexample shows the result in a histogram.
```{r}
#| label: fig-6thisexample
#| fig-cap: "A histogram of the sample means divided by the sample standard errors, this is a t-distribution."
hist(sample_mean/sample_se, breaks=30)
```
Interesting, we can see the histogram is shaped like a normal curve. It is centered on 0, which is the most common value. As values become more extreme, they become less common. If you remember, our formula for $t$, was the mean divided by the standard error of the mean. That's what we did here. This histogram is showing you a $t$-distribution.
### Calculating t from data
Let's briefly calculate a t-value from a small sample. Let's say we had 10 students do a true/false quiz with 5 questions on it. There's a 50% chance of getting each answer correct.
Every student completes the 5 questions, we grade them, and then we find their performance (mean percent correct). What we want to know is whether the students were guessing. If they were all guessing, then the sample mean should be about 50%, it shouldn't be different from chance, which is 50%. Let's look at @tbl-tsmall.
```{r, echo=FALSE}
#| label: tbl-tsmall
#| tbl-cap: "Calculating the t-value for a one-sample test."
library(dplyr)
students <- 1:10
scores <- c(50, 70, 60, 40, 80, 30, 90, 60, 70, 60)
mean_scores <- mean(scores)
Difference_from_Mean <- scores - mean_scores
Squared_Deviations <- Difference_from_Mean ^ 2
the_df <- data.frame(students,
scores,
mean = rep(mean_scores, 10),
Difference_from_Mean,
Squared_Deviations)
the_df <- the_df %>%
rbind(c("Sums", colSums(the_df[1:10, 2:5]))) %>%
rbind(c("Means", colMeans(the_df[1:10, 2:5]))) %>%
rbind(c(" ", " ", " ", "sd ", round(sd(the_df[1:10, 2]), digits = 2))) %>%
rbind(c(" ", " ", " ", "SEM ", round(sd(the_df[1:10, 2]) / sqrt(10), digits =
2))) %>%
rbind(c(" ", " ", " ", "t", (61 - 50) / round(sd(the_df[1:10, 2]) / sqrt(10), digits =
2)))
knitr::kable(the_df)
```
You can see the `scores` column has all of the test scores for each of the 10 students. We did the things we need to do to compute the standard deviation.
Remember the sample standard deviation is the square root of the sample variance, or:
$\text{sample standard deviation} = \sqrt{\frac{\sum_{i}^{n}({x_{i}-\bar{x})^2}}{N-1}}$
$\text{sd} = \sqrt{\frac{2890}{10-1}} = 17.92$
The standard error of the mean, is the standard deviation divided by the square root of N
$\text{SEM} = \frac{s}{\sqrt{N}} = \frac{17.92}{10} = 5.67$
$t$ is the difference between our sample mean (61), and our population mean (50, assuming chance), divided by the standard error of the mean.
$\text{t} = \frac{\bar{X}-u}{S_{\bar{X}}} = \frac{\bar{X}-u}{SEM} = \frac{61-50}{5.67} = 1.94$
And, that is you how calculate $t$, by hand. It's a pain. I was annoyed doing it this way. In the lab, you learn how to calculate $t$ using software, so it will just spit out $t$. For example in R, all you have to do is this:
```{r, echo=TRUE}
t.test(scores, mu=50)
```
### How does t behave?
If $t$ is just a number that we can compute from our sample (it is), what can we do with it? How can we use $t$ for statistical inference?
Remember back to the chapter on sampling and distributions, that's where we discussed the sampling distribution of the sample mean. Remember, we made a lot of samples, then computed the mean for each sample, then we plotted a histogram of the sample means. Later, in that same section, we mentioned that we could generate sampling distributions for any statistic. For each sample, we could compute the mean, the standard deviation, the standard error, and now even $t$, if we wanted to. We could generate 10,000 samples, and draw four histograms, one for each sampling distribution for each statistic.
This is exactly what I did, and the results are shown in the four panels of @fig-6sampdists below. I used a sample size of 20, and drew random observations for each sample from a normal distribution, with mean = 0, and standard deviation = 1. Let's look at the sampling distributions for each of the statistics. $t$ was computed assuming with the population mean assumed to be 0.
```{r}
#| label: fig-6sampdists
#| fig-cap: "Sampling distributions for the mean, standard deviation, standard error of the mean, and $t$."
#| out-width: "100%"
all_df <- data.frame()
for (i in 1:10000) {
sample <- rnorm(20, 0, 1)
sample_mean <- mean(sample)
sample_sd <- sd(sample)
sample_se <- sd(sample) / sqrt(length(sample))
sample_t <- as.numeric(t.test(sample, mu = 0)$statistic)
t_df <- data.frame(i, sample_mean, sample_sd, sample_se, sample_t)
all_df <- rbind(all_df, t_df)
}
library(ggpubr)
a <- ggplot(all_df, aes(x = sample_mean)) +
geom_histogram(color = "white") +
theme_classic(base_size=20)
b <- ggplot(all_df, aes(x = sample_sd)) +
geom_histogram(color = "white") +
theme_classic(base_size=20)
c <- ggplot(all_df, aes(x = sample_se)) +
geom_histogram(color = "white") +
theme_classic(base_size=20)
d <- ggplot(all_df, aes(x = sample_t)) +
geom_histogram(color = "white") +
theme_classic(base_size=20)
ggarrange(a, b, c, d,
ncol = 2, nrow = 2)
```
We see four sampling distributions. This is how statistical summaries of these summaries behave. We have used the word chance windows before. These are four chance windows, measuring different aspects of the sample. In this case, all of the samples came from the same normal distribution. Because of sampling error, each sample is not identical. The means are not identical, the standard deviations are not identical, sample standard error of the means are not identical, and the $t$s of the samples are not identical. They all have some variation, as shown by the histograms. This is how samples of size 20 behave.
We can see straight away, that in this case, we are unlikely to get a sample mean of 2. That's way outside the window. The range for the sampling distribution of the mean is around -.5 to +.5, and is centered on 0 (the population mean, would you believe!).
We are unlikely to get sample standard deviations of between .6 and 1.5, that is a different range, specific to the sample standard deviation.
Same thing with the sample standard error of the mean, the range here is even smaller, mostly between .1, and .3. You would rarely find a sample with a standard error of the mean greater than .3. Virtually never would you find one of say 1 (for this situation).
Now, look at $t$. It's range is basically between -3 and +3 here. 3s barely happen at all. You pretty much never see a 5 or -5 in this situation.
All of these sampling windows are chance windows, and they can all be used in the same way as we have used similar sampling distributions before (e.g., Crump Test, and Randomization Test) for statistical inference. For all of them we would follow the same process:
1. Generate these distributions
2. Look at your sample statistics for the data you have (mean, SD, SEM, and $t$)
3. Find the likelihood of obtaining that value or greater
4. Obtain that probability
5. See if you think your sample statistics were probable or improbable.
We'll formalize this in a second. I just want you to know that what you will be doing is something that you have already done before. For example, in the Crump test and the Randomization test we focused on the distribution of mean differences. We could do that again here, but instead, we will focus on the distribution of $t$ values. We then apply the same kinds of decision rules to the $t$ distribution, as we did for the other distributions. Below you will see a graph you have already seen, except this time it is a distribution of $t$s, not mean differences:
Remember, if we obtained a single $t$ from one sample we collected, we could consult the chance window in @fig-6tdecision below to find out whether the $t$ we obtained from the sample was likely or unlikely to occur by chance.
```{r}
#| label: fig-6tdecision
#| fig-cap: "Applying decision criteria to the $t$-distribution. Histogram of $t$s from samples (n=20) drawn from the same normal distribution (u=0, sd=1)"
sample_t <- all_df$sample_t
ggplot(all_df, aes(x = sample_t)) +
annotate(
"rect",
xmin = min(sample_t),
xmax = max(sample_t),
ymin = 0,
ymax = Inf,
alpha = 0.5,
fill = "red"
) +
annotate(
"rect",
xmin = min(sample_t),
xmax = -1.94,
ymin = 0,
ymax = Inf,
alpha = 0.7,
fill = "light grey"
) +
annotate(
"rect",
xmin = 1.94,
xmax = max(sample_t),
ymin = 0,
ymax = Inf,
alpha = 0.7,
fill = "light grey"
) +
geom_rect(aes(
xmin = -Inf,
xmax = min(sample_t),
ymin = 0,
ymax = Inf
),
alpha = .5,
fill = "lightgreen") +
geom_rect(aes(
xmin = max(sample_t),
xmax = Inf,
ymin = 0,
ymax = Inf
),
alpha = .5,
fill = "lightgreen") +
geom_histogram(bins = 50, color = "white") +
theme_classic() +
geom_vline(xintercept = min(sample_t)) +
geom_vline(xintercept = max(sample_t)) +
geom_vline(xintercept = -1.94) +
geom_vline(xintercept = 1.94) +
xlim(-8, 8) +
geom_label(data = data.frame(x = 0, y = 250, label = "CHANCE"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = -7, y = 250, label = "NOT \n CHANCE"),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = 7, y = 250, label = "NOT \n CHANCE"),
aes(x = x, y = y, label = label)) +
# geom_label(data = data.frame(x = min(sample_t), y = 600,
# label = paste0("min \n",round(min(sample_t)))),
# aes(x = x, y = y, label = label))+
#geom_label(data = data.frame(x = max(sample_t), y = 600,
# label = paste0("max \n",round(max(sample_t)))),
# aes(x = x, y = y, label = label))+
geom_label(data = data.frame(x = -4, y = 250,
label = "?"),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = 4, y = 250,
label = "?"),
aes(x = x, y = y, label = label)) +
xlab("mean sample_t")
```
### Making a decision
From our early example involving the TRUE/FALSE quizzes, we are now ready to make some kind of decision about what happened there. We found a mean difference of `r mean(scores)-50`. We found a $t$ = `r t.test(scores, mu=50)$statistic`. The probability of this $t$ or larger occurring is $p$ = `r t.test(scores, mu=50)$p.value`. We were testing the idea that our sample mean of `r mean(scores)` could have come from a normal distribution with mean = 50. The $t$ test tells us that the $t$ for our sample, or a larger one, would happen with p = `r t.test(scores, mu=50)$p.value`. In other words, chance can do it a kind of small amount of time, but not often. In English, this means that all of the students could have been guessing, but it wasn't that likely that were just guessing.
The next $t$-test is called a **paired samples t-test**. And, spoiler alert, we will find out that a paired samples t-test is actually a one-sample t-test in disguise (WHAT!), yes it is. If the one-sample $t$-test didn't make sense to you, read the next section.
## Paired-samples t-test
For me (Crump), many analyses often boil down to a paired samples t-test. It just happens that many things I do reduce down to a test like this.
I am a cognitive psychologist, I conduct research about how people do things like remember, pay attention, and learn skills. There are lots of Psychologists like me, who do very similar things.
We all often conduct the same kinds of experiments. They go like this, and they are called **repeated measures** designs. They are called **repeated measures** designs, because we measure how one person does something more than once, we **repeat** the measure.
So, I might measure somebody doing something in condition A, and measure the same person doing something in Condition B, and then I see that same person does different things in the two conditions. I **repeatedly measure** the same person in both conditions. I am interested in whether the experimental manipulation changes something about how people perform the task in question.
### Mehr, Song, and Spelke (2016)
We will introduce the paired-samples t-test with an example using real data, from a real study. @mehr20165 were interested in whether singing songs to infants helps infants become more sensitive to social cues. For example, infants might need to learn to direct their attention toward people as a part of learning how to interact socially with people. Perhaps singing songs to infants aids this process of directing attention. When an infant hears a familiar song, they might start to pay more attention to the person singing that song, even after they are done singing the song. The person who sang the song might become more socially important to the infant. You will learn more about this study in the lab for this week. This example, prepares you for the lab activities. Here is a brief summary of what they did.
First, parents were trained to sing a song to their infants. After many days of singing this song to the infants, a parent came into the lab with their infant. In the first session, parents sat with their infants on their knees, so the infant could watch two video presentations. There were two videos. Each video involved two unfamiliar new people the infant had never seen before. Each new person in the video (the singers) sang one song to the infant. One singer sang the "familiar" song the infant had learned from their parents. The other singer sang an "unfamiliar" song the infant had not hear before.
There were two really important measurement phases: the baseline phase, and the test phase.
The baseline phase occurred before the infants saw and heard each singer sing a song. During the baseline phase, the infants watched a video of both singers at the same time. The researchers recorded the proportion of time that the infant looked at each singer. The baseline phase was conducted to determine whether infants had a preference to look at either person (who would later sing them a song).
The test phase occurred **after** infants saw and heard each song, sung by each singer. During the test phase, each infant had an opportunity to watch silent videos of both singers. The researchers measured the proportion of time the infants spent looking at each person. The question of interest, was whether the infants would spend a greater proportion of time looking at the singer who sang the familiar song, compared to the singer who sang the unfamiliar song.
There is more than one way to describe the design of this study. We will describe it like this. It was a repeated measures design, with one independent (manipulation) variable called Viewing phase: Baseline versus Test. There was one dependent variable (the measurement), which was proportion looking time (to singer who sung familiar song). This was a repeated measures design because the researchers measured proportion looking time twice (they repeated the measure), once during baseline (before infants heard each singer sing a song), and again during test (after infants head each singer sing a song).
The important question was whether infants would change their looking time, and look more at the singer who sang the familiar song during the test phase, than they did during the baseline phase. This is a question about a change within individual infants. In general, the possible outcomes for the study are:
1. No change: The difference between looking time toward the singer of the familiar song during baseline and test is zero, no difference.
2. Positive change: Infants will look longer toward the singer of the familiar song during the test phase (after they saw and heard the singers), compared to the baseline phase (before they saw and heard the singers). This is a positive difference if we use the formula: Test Phase Looking time - Baseline phase looking time (to familiar song singer).
3. Negative change: Infants will look longer toward the singer of the unfamiliar song during the test phase (after they saw and heard the singers), compared to the baseline phase (before they saw and heard the singers). This is a negative difference if we use the same formula: Test Phase Looking time - Baseline phase looking time (to familiar song singer).
### The data
Let's take a look at the data for the first 5 infants in the study. This will help us better understand some properties of the data before we analyze it. We will see that the data is structured in a particular way that we can take advantage of with a paired samples t-test. Note, we look at the first 5 infants to show how the computations work. The results of the paired-samples t-test change when we use all of the data from the study.
Here is a table of the data:
```{r, echo=FALSE}
library(data.table)
all_data <- fread("data/MehrSongSpelke2016.csv")
library(dplyr)
experiment_one <- all_data %>% filter(exp1==1)
paired_sample_df <- data.frame(infant=1:5,
Baseline = round(experiment_one$Baseline_Proportion_Gaze_to_Singer[1:5],digits=2),
Test = round(experiment_one$Test_Proportion_Gaze_to_Singer[1:5], digits=2))
# mean_scores <- mean(scores)
# Difference_from_Mean <- scores-mean_scores
# Squared_Deviations <- Difference_from_Mean^2
# the_df<-data.frame(students,
# scores,
# mean=rep(mean_scores,10),
# Difference_from_Mean,
# Squared_Deviations)
#
# the_df <- the_df %>%
# rbind(c("Sums",colSums(the_df[1:10,2:5]))) %>%
# rbind(c("Means",colMeans(the_df[1:10,2:5]))) %>%
# rbind(c(" "," "," ","sd ",round(sd(the_df[1:10,2]),digits=2))) %>%
# rbind(c(" "," "," ","SEM ",round(sd(the_df[1:10,2])/sqrt(10), digits=2))) %>%
# rbind(c(" "," "," ","t",(61-50)/round(sd(the_df[1:10,2])/sqrt(10), digits=2)))
knitr::kable(paired_sample_df)
```
The table shows proportion looking times toward the singer of the familiar song during the Baseline and Test phases. Notice there are five different infants, (1 to 5). Each infant is measured twice, once during the Baseline phase, and once during the Test phase. To repeat from before, this is a repeated-measures design, because the infants are measured repeatedly (twice in this case). Or, this kind of design is also called a **paired-samples** design. Why? because each participant comes with a pair of samples (two samples), one for each level of the design.
Great, so what are we really interested in here? We want to know if the mean looking time toward the singer of the familiar song for the Test phase is higher than the Baseline phase. We are comparing the two sample means against each other and looking for a difference. We already know that differences could be obtained by chance alone, simply because we took two sets of samples, and we know that samples can be different. So, we are interested in knowing whether chance was likely or unlikely to have produced any difference we might observe.
In other words, we are interested in looking at the difference scores between the baseline and test phase for each infant. The question here is, for each infant, did their proportion looking time to the singer of the familiar song, increase during the test phase as compared to the baseline phase.
### The difference scores
Let's add the difference scores to the table of data so it is easier to see what we are talking about. The first step in creating difference scores is to decide how you will take the difference, there are two options:
1. Test phase score - Baseline Phase Score
2. Baseline phase score - Test Phase score
Let's use the first formula. Why? Because it will give us positive differences when the test phase score is higher than the baseline phase score. This makes a positive score meaningful with respect to the study design, we know (because we defined it to be this way), that positive scores will refer to longer proportion looking times (to singer of familiar song) during the test phase compared to the baseline phase.
```{r}
paired_sample_df <- cbind(paired_sample_df,
differences = (paired_sample_df$Test-
paired_sample_df$Baseline))
knitr::kable(paired_sample_df)
```
There we have it, the difference scores. The first thing we can do here is look at the difference scores, and ask how many infants showed the effect of interest. Specifically, how many infants showed a positive difference score. We can see that three of five infants showed a positive difference (they looked more at the singer of the familiar song during the test than baseline phase), and two the infants showed the opposite effect (negative difference, they looked more at the singer of the familiar song during baseline than test).
### The mean difference
As we have been discussing, the effect of interest in this study is the mean difference between the baseline and test phase proportion looking times. We can calculate the **mean difference**, by finding the **mean of the difference scores**. Let's do that, in fact, for fun let's calculate the mean of the baseline scores, the test scores, and the difference scores.
```{r}
paired_sample_df <- paired_sample_df %>%
rbind(c("Sums",colSums(paired_sample_df[1:5,2:4]))) %>%
rbind(c("Means",colMeans(paired_sample_df[1:5,2:4])))
knitr::kable(paired_sample_df)
```
We can see there was a positive mean difference of 0.054, between the test and baseline phases.
Can we rush to judgment and conclude that infants are more socially attracted to individuals who have sung them a familiar song? I would hope not based on this very small sample. First, the difference in proportion looking isn't very large, and of course we recognize that this difference could have been produced by chance.
We will more formally evaluate whether this difference could have been caused by chance with the paired-samples t-test. But, before we do that, let's again calculate $t$ and discuss what $t$ tells us over and above what our measure of the mean of the difference scores tells us.
### Calculate t
OK, so how do we calculate $t$ for a paired-samples $t$-test? Surprise, we use the one-sample t-test formula that you already learned about! Specifically, we use the one-sample $t$-test formula on the difference scores. We have one sample of difference scores (you can see they are in one column), so we can use the one-sample $t$-test on the difference scores. Specifically, we are interested in comparing whether the mean of our difference scores came from a distribution with mean difference = 0. This is a special distribution we refer to as the **null distribution**. It is the distribution no differences. Of course, this **null distribution** can produce differences due to to sampling error, but those differences are not caused by any experimental manipulation, they caused by the random sampling process.
We calculate $t$ in a moment. Let's now consider again why we want to calculate $t$? Why don't we just stick with the mean difference we already have?
Remember, the whole concept behind $t$, is that it gives an indication of how confident we should be in our mean. Remember, $t$ involves a measure of the mean in the numerator, divided by a measure of variation (standard error of the sample mean) in the denominator. The resulting $t$ value is small when the mean difference is small, or when the variation is large. So small $t$-values tell us that we shouldn't be that confident in the estimate of our mean difference. Large $t$-values occur when the mean difference is large and/or when the measure of variation is small. So, large $t$-values tell us that we can be more confident in the estimate of our mean difference. Let's find $t$ for the mean difference scores. We use the same formulas as we did last time:
```{r, echo=FALSE}
paired_sample_df <- data.frame(infant=1:5,
Baseline = round(experiment_one$Baseline_Proportion_Gaze_to_Singer[1:5],digits=2),
Test = round(experiment_one$Test_Proportion_Gaze_to_Singer[1:5], digits=2))
differences <- paired_sample_df$Test-paired_sample_df$Baseline
diff_from_mean <- differences-mean(differences)
Squared_differences <- diff_from_mean^2
paired_sample_df <- cbind(paired_sample_df,
differences,
diff_from_mean,
Squared_differences)
paired_sample_df <- paired_sample_df %>%
rbind(c("Sums",colSums(paired_sample_df[1:5,2:6]))) %>%
rbind(c("Means",colMeans(paired_sample_df[1:5,2:6]))) %>%
rbind(c(" "," "," "," ","sd ",round(sd(paired_sample_df[1:5,4]),digits=3))) %>%
rbind(c(" "," "," "," ","SEM ",round(sd(paired_sample_df[1:5,4])/sqrt(5), digits=3))) %>%
rbind(c(" "," "," "," ","t",mean(differences)/round(sd(paired_sample_df[1:5,4])/sqrt(5), digits=3)))
paired_sample_df[6,5]<-0
paired_sample_df[7,5]<-0
knitr::kable(paired_sample_df)
#t.test(round(differences, digits=2), mu=0)$statistic
```
If we did this test using R, we would obtain almost the same numbers (there is a little bit of rounding in the table).
```{r}
t.test(differences,mu=0)
```
Here is a quick write up of our t-test results, t(4) = .72, p = .509.
What does all of that tell us? There's a few things we haven't gotten into much yet. For example, the 4 represents degrees of freedom, which we discuss later. The important part, the $t$ value should start to be a little bit more meaningful. We got a kind of small t-value didn't we. It's .72. What can we tell from this value? First, it is positive, so we know the mean difference is positive. The sign of the $t$-value is always the same as the sign of the mean difference (ours was +0.054). We can also see that the p-value was .509. We've seen p-values before. This tells us that our $t$ value or larger, occurs about 50.9% of the time... Actually it means more than this. And, to understand it, we need to talk about the concept of two-tailed and one-tailed tests.
### Interpreting $t$s
Remember what it is we are doing here. We are evaluating whether our sample data could have come from a particular kind of distribution. The null distribution of no differences. This is the distribution of $t$-values that would occur for samples of size 5, with a mean difference of 0, and a standard error of the sample mean of .075 (this is the SEM that we calculated from our sample). We can see what this particular null-distribution looks like in @fig-6tnull.
```{r}
#| label: fig-6tnull
#| fig-cap: "A distribution of $t$-values that can occur by chance alone, when there is no difference between the sample and a population"
range <- seq(-3, 3, .1)
null_distribution <- dt(range, 4, log = FALSE)
plot_df <- data.frame(range, null_distribution)
ggplot(plot_df, aes(x = range, y = null_distribution)) +
geom_line() +
xlab("t-values") +
ylab("Probability") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
scale_x_continuous(breaks = (seq(-3, 3, .5))) +
geom_label(data = data.frame(x = -.7, y = .1, label = "50% \n (-)"),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = .7, y = .1, label = "50% \n (+)"), aes(x = x, y = y, label = label)) +
geom_vline(xintercept = 0)+
theme_classic(base_size = 20)
```
The $t$-distribution above shows us the kinds of values $t$ will will take by chance alone, when we measure the mean differences for pairs of 5 samples (like our current). $t$ is most likely to be zero, which is good, because we are looking at the distribution of no-differences, which should most often be 0! But, sometimes, due to sampling error, we can get $t$s that are bigger than 0, either in the positive or negative direction. Notice the distribution is symmetrical, a $t$ from the null-distribution will be positive half of the time, and negative half of the time, that is what we would expect by chance.
So, what kind of information do we want know when we find a particular $t$ value from our sample? We want to know how likely the $t$ value like the one we found occurs just by chance. This is actually a subtly nuanced kind of question. For example, any particular $t$ value doesn't have a specific probability of occurring. When we talk about probabilities, we are talking about ranges of probabilities. Let's consider some probabilities. We will use the letter $p$, to talk about the probabilities of particular $t$ values.
1. What is the probability that $t$ is zero or positive or negative? The answer is p=1, or 100%. We will always have a $t$ value that is zero or non-zero...Actually, if we can't compute the t-value, for example when the standard deviation is undefined, I guess then we would have a non-number. But, assuming we can calculate $t$, then it will always be 0 or positive or negative.
2. What is the probability of $t$ = 0 or greater than 0? The answer is p=.5, or 50%. 50% of $t$-values are 0 or greater.
3. What is the of $t$ = 0 or smaller than 0? The answer is p=.5, or 50%. 50% of $t$-values are 0 or smaller.
We can answer all of those questions just by looking at our t-distribution, and dividing it into two equal regions, the left side (containing 50% of the $t$ values), and the right side containing 50% of the $t$-values).
What if we wanted to take a more fine-grained approach, let's say we were interested in regions of 10%. What kinds of $t$s occur 10% of the time. We would apply lines like the following. Notice, the likelihood of bigger numbers (positive or negative) gets smaller, so we have to increase the width of the bars for each of the intervals between the bars to contain 10% of the $t$-values, it looks like @fig-6percentregions.
```{r}
#| label: fig-6percentregions
#| fig-cap: "Splitting the t distribution up into regions each containing 10% of the $t$-values. The width between the bars narrows as they approach the center of the distribution, where there are more $t$-values."
#| out-width: "100%"
range <- seq(-3, 3, .1)
null_distribution <- dt(range, 4, log = FALSE)
plot_df <- data.frame(range, null_distribution)
t_ps <- qt(seq(.1, .9, .1), 4)
ggplot(plot_df, aes(x = range, y = null_distribution)) +
geom_line() +
xlab("t-values") +
ylab("Probability") +
geom_vline(xintercept = t_ps) +
theme_classic(base_size = 15) +
theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
scale_x_continuous(breaks = round(t_ps, digits = 1)) +
geom_label(data = data.frame(x = -2.5, y = .2, label = "10%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = -1.3, y = .2, label = "10%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = 2.5, y = .2, label = "10%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = 1.3, y = .2, label = "10%"), aes(x = x, y = y, label = label))
```
Consider the probabilities ($p$) of $t$ for the different ranges.
1. $t$ \<= -1.5 ($t$ is less than or equal to -1.5), $p$ = 10%
2. -1.5 \>= $t$ \<= -0.9 ($t$ is equal to or between -1.5 and -.9), $p$ = 10%
3. -.9 \>= $t$ \<= -0.6 ($t$ is equal to or between -.9 and -.6), $p$ = 10%
4. $t$ \>= 1.5 ($t$ is greater than or equal to 1.5), $p$ = 10%
Notice, that the $p$s are always 10%. $t$s occur in these ranges with 10% probability.
### Getting the p-values for $t$-values
You might be wondering where I am getting some of these values from. For example, how do I know that 10% of $t$ values (for this null distribution) have a value of approximately 1.5 or greater than 1.5? The answer is I used R to tell me.
In most statistics textbooks the answer would be: there is a table at the back of the book where you can look these things up...This textbook has no such table. We could make one for you. And, we might do that. But, we didn't do that yet...
So, where do these values come from, how can you figure out what they are? The complicated answer is that we are not going to explain the math behind finding these values because, 1) the authors (some of us) admittedly don't know the math well enough to explain it, and 2) it would sidetrack us to much, 3) you will learn how to get these numbers in the lab with software, 4) you will learn how to get these numbers in lab without the math, just by doing a simulation, and 5) you can do it in R, or excel, or you can use an [online calculator](http://www.socscistatistics.com/pvalues/tdistribution.aspx).
This is all to say that you can find the $t$s and their associated $p$s using software. But, the software won't tell you what these values mean. That's we are doing here. You will also see that software wants to know a few more things from you, such as the degrees of freedom for the test, and whether the test is one-tailed or two tailed. We haven't explained any of these things yet. That's what we are going to do now. Note, we explain degrees of freedom last. First, we start with a one-tailed test.
### One-tailed tests
A **one-tailed test** is sometimes also called a directional test. It is called a directional test, because a researcher might have a hypothesis in mind suggesting that the difference they observe in their means is going to have a particular direction, either a positive difference, or a negative difference.
Typically, a researcher would set an **alpha criterion**. The alpha criterion describes a line in the sand for the researcher. Often, the alpha criterion is set at $p = .05$. What does this mean? @fig-6critT shows the $t$-distribution and the alpha criterion.
```{r}
#| label: fig-6critT
#| fig.cap: "The critical value of t for an alpha criterion of 0.05. 5% of all ts are at this value or larger"
range <- seq(-3, 3, .1)
null_distribution <- dt(range, 4, log = FALSE)
plot_df <- data.frame(range, null_distribution)
t_ps <- qt(seq(.1, .9, .1), 4)
ggplot(plot_df, aes(x = range, y = null_distribution)) +
geom_line() +
xlab("t-values") +
ylab("Probability") +
geom_vline(xintercept = qt(.95, 4, lower.tail = TRUE)) +
ggtitle("Critical t for one-tailed test") +
theme_classic(base_size = 10) +
theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
annotate(
"rect",
xmin = qt(.95, 4, lower.tail = TRUE),
xmax = 3,
ymin = 0,
ymax = Inf,
alpha = 0.5,
fill = "green"
) +
geom_label(data = data.frame(x = 2.5, y = .2, label = "5%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.95, 4, lower.tail = TRUE),
y = .3,
label = "Critical t"
),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.95, 4, lower.tail = TRUE),
y = .25,
label = round(qt(.95, 4, lower.tail = TRUE), digits =
2)
),
aes(x = x, y = y, label = label))
```
The figure shows that $t$ values of +2.13 or greater occur 5% of the time. Because the t-distribution is symmetrical, we also know that $t$ values of -2.13 or smaller also occur 5% of the time. Both of these properties are true under the null distribution of no differences. This means, that when there really are no differences, a researcher can expect to find $t$ values of 2.13 or larger 5% of the time.
Let's review and connect some of the terms:
1. **alpha criterion**: the criterion set by the researcher to make decisions about whether they believe chance did or did not cause the difference. The alpha criterion here is set to $p = .05$.
2. **Critical** $t$. The critical $t$ is the $t$-value associated with the alpha-criterion. In this case for a one-tailed test, it is the $t$ value where 5% of all $t$s are this number or greater. In our example, the critical $t$ is 2.13. 5% of all $t$ values (with degrees of freedom = 4) are +2.13, or greater than +2.13.
3. **Observed** $t$. The observed $t$ is the one that you calculated from your sample. In our example about the infants, the observed $t$ was $t$ (4) = 0.72.
4. **p-value**. The $p$-value is the probability of obtaining the observed $t$ value or larger. Now, you could look back at our previous example, and find that the $p$-value for $t$ (4) = .72, was $p = .509$ . HOWEVER, this p-value was not calculated for a one-directional test...(we talk about what .509 means in the next section).
@fig-6tonedirection shows what the $p$-value for $t$ (4) = .72 using a one-directional test would would look like:
```{r}
#| label: fig-6tonedirection
#| fig-cap: "A case where the observed value of t is much less than the critical value for a one-directional t-test."
range <- seq(-3, 3, .1)
null_distribution <- dt(range, 4, log = FALSE)
plot_df <- data.frame(range, null_distribution)
t_ps <- qt(seq(.1, .9, .1), 4)
ggplot(plot_df, aes(x = range, y = null_distribution)) +
geom_line() +
xlab("t-values") +
ylab("Probability") +
geom_vline(xintercept = .72) +
geom_vline(xintercept = qt(.95, 4, lower.tail = TRUE)) +
ggtitle("t value and p-range for one-directional test") +
theme_classic(base_size = 10) +
theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
annotate(
"rect",
xmin = .72,
xmax = 3,
ymin = 0,
ymax = Inf,
alpha = 0.5,
fill = "grey"
) +
annotate(
"rect",
xmin = qt(.95, 4, lower.tail = TRUE),
xmax = 3,
ymin = 0,
ymax = .25,
alpha = 0.5,
fill = "green"
) +
geom_label(data = data.frame(x = 2.5, y = .2, label = "5%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.95, 4, lower.tail = TRUE),
y = .3,
label = "Critical t"
),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.95, 4, lower.tail = TRUE),
y = .25,
label = round(qt(.95, 4, lower.tail = TRUE), digits =
2)
),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = .72, y = .1,
label = "Observed t"),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(x = .72, y = .05,
label = ".72, p="),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = 1.5,
y = .05,
label = round(pt(.72, 4, lower.tail = FALSE), digits =
3)
),
aes(x = x, y = y, label = label))
```
Let's take this one step at a time. We have located the observed $t$ of .72 on the graph. We shaded the right region all grey. What we see is that the grey region represents .256 or 25.6% of all $t$ values. In other words, 25.6% of $t$ values are .72 or larger than .72. You could expect, by chance alone, to a find a $t$ value of .72 or larger, 25.6% of the time. That's fairly often. We did find a $t$ value of .72. Now that you know this kind of $t$ value or larger occurs 25.6% of the time, would you be confident that the mean difference was not due to chance? Probably not, given that chance can produce this difference fairly often.
Following the "standard" decision making procedure, we would claim that our $t$ value was **not statistically significant**, because it was not large enough. If our observed value was larger than the critical $t$ (larger than 2.13), defined by our alpha criterion, then we would claim that our $t$ value was **statistically significant**. This would be equivalent to saying that we believe it is unlikely that the difference we observed was due to chance. In general, for any observed $t$ value, the associated $p$-value tells you how likely a $t$ of the observed size or larger would be observed. The $p$-value **always** refers to a **range** of $t$-values, never to a single $t$-value. Researchers use the alpha criterion of .05, as a matter of convenience and convention. There are other ways to interpret these values that do not rely on a strict (significant versus not) dichotomy.
### Two-tailed tests
OK, so that was one-tailed tests... What are two tailed tests? The $p$-value that we originally calculated from our paired-samples $t$-test was for a 2-tailed test. Often, the default is that the $p$-value is for a two-tailed test.
The two-tailed test, is asking a more general question about whether a difference is likely to have been produced by chance. The question is: what is probability of any difference. It is also called a **non-directional** test, because here we don't care about the direction or sign of the difference (positive or negative), we just care if there is any kind of difference.
The same basic things as before are involved. We define an alpha criterion ($\alpha = 0.05$). And, we say that any observed $t$ value that has a probability of $p$ \<.05 ($p$ is less than .05) will be called **statistically significant**, and ones that are more likely ($p$ \>.05, $p$ is greater than .05) will be called null-results, or not statistically significant. The only difference is how we draw the alpha range. Before it was on the right side of the $t$ distribution (we were conducting a one-sided test remember, so we were only interested in one side).
@fig-6twotailedt shows what the most extreme 5% of the $t$-values are when we ignore their sign (whether they are positive or negative).
```{r}
#| label: fig-6twotailedt
#| fig.cap: "Critical values for a two-tailed test. Each line represents the location where 2.5% of all $t$s are larger or smaller than critical value. The total for both tails is 5%"
range <- seq(-4, 4, .1)
null_distribution <- dt(range, 4, log = FALSE)
plot_df <- data.frame(range, null_distribution)
t_ps <- qt(seq(.1, .9, .1), 4)
ggplot(plot_df, aes(x = range, y = null_distribution)) +
geom_line() +
xlab("t-values") +
ylab("Probability") +
geom_vline(xintercept = qt(.975, 4, lower.tail = TRUE)) +
geom_vline(xintercept = qt(.025, 4, lower.tail = TRUE)) +
ggtitle("Critical ts for two-tailed test") +
theme_classic(base_size = 10) +
theme(axis.text.y = element_blank(), axis.ticks = element_blank()) +
annotate(
"rect",
xmin = qt(.975, 4, lower.tail = TRUE),
xmax = 4,
ymin = 0,
ymax = Inf,
alpha = 0.5,
fill = "green"
) +
geom_label(data = data.frame(x = 3.5, y = .2, label = "2.5%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.975, 4, lower.tail = TRUE),
y = .3,
label = "Critical t"
),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.975, 4, lower.tail = TRUE),
y = .25,
label = round(qt(.975, 4, lower.tail = TRUE), digits =
2)
),
aes(x = x, y = y, label = label)) +
annotate(
"rect",
xmin = -4,
xmax = qt(.025, 4, lower.tail = TRUE),
ymin = 0,
ymax = Inf,
alpha = 0.5,
fill = "green"
) +
geom_label(data = data.frame(x = -3.5, y = .2, label = "2.5%"), aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.025, 4, lower.tail = TRUE),
y = .3,
label = "Critical t"
),
aes(x = x, y = y, label = label)) +
geom_label(data = data.frame(
x = qt(.025, 4, lower.tail = TRUE),
y = .25,
label = round(qt(.025, 4, lower.tail = TRUE), digits =
2)
),
aes(x = x, y = y, label = label))
```
Here is what we are seeing. A distribution of no differences (the null, which is what we are looking at), will produce $t$s that are 2.78 or greater 2.5% of the time, and $t$s that are -2.78 or smaller 2.5% of the time. 2.5% + 2.5% is a total of 5% of the time. We could also say that $t$s larger than +/- 2.78 occur 5% of the time.
As a result, the critical $t$ value is (+/-) 2.78 for a two-tailed test. As you can see, the two-tailed test is blind to the direction or sign of the difference. Because of this, the critical $t$ value is also higher for a two-tailed test, than for the one-tailed test that we did earlier. Hopefully, now you can see why it is called a two-tailed test. There are two tails of the distribution, one on the left and right, both shaded in green.
### One or two tailed, which one?
Now that you know there are two kinds of tests, one-tailed, and two-tailed, which one should you use? There is some conventional wisdom on this, but also some debate. In the end, it is up to you to be able to justify your choice and why it is appropriate for you data. That is the real answer.
The conventional answer is that you use a one-tailed test when you have a theory or hypothesis that is making a directional prediction (the theory predicts that the difference will be positive, or negative). Similarly, use a two-tailed test when you are looking for any difference, and you don't have a theory that makes a directional prediction (it just makes the prediction that there will be a difference, either positive or negative).
Also, people appear to choose one or two-tailed tests based on how risky they are as researchers. If you always ran one-tailed tests, your critical $t$ values for your set alpha criterion would always be smaller than the critical $t$s for a two-tailed test. Over the long run, you would make more type I errors, because the criterion to detect an effect is a lower bar for one than two tailed tests.
> Remember type 1 errors occur when you reject the idea that chance could have caused your difference. You often never know when you make this error. It happens anytime that sampling error was the actual cause of the difference, but a researcher dismisses that possibility and concludes that their manipulation caused the difference.
Similarly, if you always ran two-tailed tests, even when you had a directional prediction, you would make fewer type I errors over the long run, because the $t$ for a two-tailed test is higher than the $t$ for a one-tailed test. It seems quite common for researchers to use a more conservative two-tailed test, even when they are making a directional prediction based on theory. In practice, researchers tend to adopt a standard for reporting that is common in their field. Whether or not the practice is justifiable can sometimes be an open question. The important task for any researcher, or student learning statistics, is to be able to justify their choice of test.
### Degrees of freedom
Before we finish up with paired-samples $t$-tests, we should talk about degrees of freedom. Our sense is that students don't really understand degrees of freedom very well. If you are reading this textbook, you are probably still wondering what is degrees of freedom, seeing as we haven't really talked about it all.
For the $t$-test, there is a formula for degrees of freedom. For the one-sample and paired sample $t$-tests, the formula is:
$\text{Degrees of Freedom} = \text{df} = n-1$. Where n is the number of samples in the test.
In our paired $t$-test example, there were 5 infants. Therefore, degrees of freedom = 5-1 = 4.
OK, that's a formula. Who cares about degrees of freedom, what does the number mean? And why do we report it when we report a $t$-test... you've probably noticed the number in parentheses e.g., $t$(4)=.72, the 4 is the $df$, or degrees of freedom.
Degrees of freedom is both a concept, and a correction. The concept is that if you estimate a property of the numbers, and you use this estimate, you will be forcing some constraints on your numbers.
Consider the numbers: 1, 2, 3. The mean of these numbers is 2. Now, let's say I told you that the mean of three numbers is 2. Then, how many of these three numbers have freedom? Funny question right. What we mean is, how many of the three numbers could be any number, or have the freedom to be any number.
The first two numbers could be any number. But, once those two numbers are set, the final number (the third number), MUST be a particular number that makes the mean 2. The first two numbers have freedom. The third number has no freedom.
To illustrate. Let's freely pick two numbers: 51 and -3. I used my personal freedom to pick those two numbers. Now, if our three numbers are 51, -3, and x, and the mean of these three numbers is 2. There is only one solution, x has to be -42, otherwise the mean won't be 2. This is one way to think about degrees of freedom. The degrees of freedom for these three numbers is n-1 = 3-1= 2, because 2 of the numbers can be free, but the last number has no freedom, it becomes fixed after the first two are decided.
Now, statisticians often apply degrees of freedom to their calculations, especially when a second calculation relies on an estimated value. For example, when we calculate the standard deviation of a sample, we first calculate the mean of the sample right! By estimating the mean, we are fixing an aspect of our sample, and so, our sample now has n-1 degrees of freedom when we calculate the standard deviation (remember for the sample standard deviation, we divide by n-1...there's that n-1 again.)
#### Simulating how degrees of freedom affects the $t$ distribution
There are at least two ways to think the degrees of freedom for a $t$-test. For example, if you want to use math to compute aspects of the $t$ distribution, then you need the degrees of freedom to plug in to the formula... If you want to see the formulas I'm talking about, scroll down on the [$t$-test wikipedia page](https://en.wikipedia.org/wiki/Student%27s_t-distribution) and look for the probability density or cumulative distribution functions...We think that is quite scary for most people, and one reason why degrees of freedom are not well-understood.
If we wanted to simulate the $t$ distribution we could more easily see what influence degrees of freedom has on the shape of the distribution. Remember, $t$ is a sample statistic, it is something we measure from the sample. So, we could simulate the process of measuring $t$ from many different samples, then plot the histogram of $t$ to show us the simulated $t$ distribution.
```{r}
#| label: fig-6dft
#| fig-cap: "The width of the t distribution shrinks as sample size and degrees of freedom (from 4 to 100) increases."
#| fig-asp: 0.5
#| out-width: "100%"
ts <- c(rt(10000, 4), rt(10000, 100))
dfs <- as.factor(rep(c(4, 100), each = 10000))
t_df <- data.frame(dfs, ts)
t_df <- t_df[abs(t_df$ts) < 5, ]
ggplot(t_df, aes(x = ts, group = dfs, color = dfs)) +
geom_histogram() +
theme_classic() +
facet_wrap( ~ dfs) +
theme_classic(base_size=15)
```
In @fig-6dft notice that the red distribution for $df = 4$, is a little bit shorter, and a little bit wider than the bluey-green distribution for $df = 100$. As degrees of freedom increase the $t$ distribution gets taller (in the middle), and narrower in the range. It get's more peaky. Can you guess the reason for this? Remember, we are estimating a sample statistic, and degrees of freedom is really just a number that refers to the number of subjects (well minus one). And, we already know that as we increase $n$, our sample statistics become better estimates (less variance) of the distributional parameters they are estimating. So, $t$ becomes a better estimate of it's "true" value as sample size increase, resulting in a more narrow distribution of $t$s.
There is a slightly different $t$ distribution for every degrees of freedom, and the critical regions associated with 5% of the extreme values are thus slightly different every time. This is why we report the degrees of freedom for each t-test, they define the distribution of $t$ values for the sample-size in question. Why do we use n-1 and not n? Well, we calculate $t$ using the sample standard deviation to estimate the standard error or the mean, that estimate uses n-1 in the denominator, so our $t$ distribution is built assuming n-1. That's enough for degrees of freedom...
## The paired samples t-test strikes back
You must be wondering if we will ever be finished talking about paired samples t-tests... why are we doing round 2, oh no! Don't worry, we're just going to 1) remind you about what we were doing with the infant study, and 2) do a paired samples t-test on the entire data set and discuss.
Remember, we were wondering if the infants would look longer toward the singer who sang the familiar song during the test phase compared to the baseline phase. We showed you data from 5 infants, and walked through the computations for the $t$-test. As a reminder, it looked like this:
```{r, echo=FALSE}
paired_sample_df <- data.frame(infant=1:5,
Baseline = round(experiment_one$Baseline_Proportion_Gaze_to_Singer[1:5],digits=2),
Test = round(experiment_one$Test_Proportion_Gaze_to_Singer[1:5], digits=2))
differences <- paired_sample_df$Test-paired_sample_df$Baseline
diff_from_mean <- differences-mean(differences)
Squared_differences <- diff_from_mean^2
paired_sample_df <- cbind(paired_sample_df,
differences,
diff_from_mean,
Squared_differences)
paired_sample_df <- paired_sample_df %>%
rbind(c("Sums",colSums(paired_sample_df[1:5,2:6]))) %>%
rbind(c("Means",colMeans(paired_sample_df[1:5,2:6]))) %>%
rbind(c(" "," "," "," ","sd ",round(sd(paired_sample_df[1:5,4]),digits=3))) %>%
rbind(c(" "," "," "," ","SEM ",round(sd(paired_sample_df[1:5,4])/sqrt(5), digits=3))) %>%
rbind(c(" "," "," "," ","t",mean(differences)/round(sd(paired_sample_df[1:5,4])/sqrt(5), digits=3)))
paired_sample_df[6,5]<-0
paired_sample_df[7,5]<-0
knitr::kable(paired_sample_df)
t.test(round(differences, digits=2), mu=0)
```
Let's write down the finding one more time: The mean difference was 0.054, $t$(4) = .72, $p$ =.509. We can also now confirm, that the $p$-value was from a two-tailed test. So, what does this all really mean.
We can say that a $t$ value with an absolute of .72 or larger occurs 50.9% of the time. More precisely, the distribution of no differences (the null), will produce a $t$ value this large or larger 50.9% of the time. In other words, chance alone good have easily produced the $t$ value from our sample, and the mean difference we observed or .054, could easily have been a result of chance.
Let's quickly put all of the data in the $t$-test, and re-run the test using all of the infant subjects.
```{r}
paired_sample_df <- data.frame(infant=1:32,
Baseline = round(experiment_one$Baseline_Proportion_Gaze_to_Singer[1:32],digits=2),
Test = round(experiment_one$Test_Proportion_Gaze_to_Singer[1:32], digits=2))
differences <- paired_sample_df$Test-paired_sample_df$Baseline
t.test(differences,mu=0)
```
Now we get a very different answer. We would summarize the results saying the mean difference was .073, t(31) = 2.44, p = 0.020. How many total infants were their? Well the degrees of freedom was 31, so there must have been 32 infants in the study. Now we see a much smaller $p$-value. This was also a two-tailed test, so we that observing a $t$ value of 2.4 or greater (absolute value) only occurs 2% of the time. In other words, the distribution of no differences will produce the observed t-value very rarely. So, it is unlikely that the observed mean difference of .073 was due to chance (it could have been due to chance, but that is very unlikely). As a result, we can be somewhat confident in concluding that something about seeing and hearing a unfamiliar person sing a familiar song, causes an infant to draw their attention toward the singer, and this potentially benefits social learning on the part of the infant.
## Independent samples t-test: The return of the t-test?
If you've been following the Star Wars references, we are on last movie (of the original trilogy)... the independent t-test. This is were basically the same story plays out as before, only slightly different.
Remember there are different $t$-tests for different kinds of research designs. When your design is a **between-subjects** design, you use an **independent samples t-test**. Between-subjects design involve different people or subjects in each experimental condition. If there are two conditions, and 10 people in each, then there are 20 total people. And, there are no paired scores, because every single person is measured once, not twice, no repeated measures. Because there are no repeated measures we can't look at the difference scores between conditions one and two. The scores are not paired in any meaningful way, to it doesn't make sense to subtract them. So what do we do?
The logic of the independent samples t-test is the very same as the other $t$-tests. We calculated the means for each group, then we find the difference. That goes into the numerator of the t formula. Then we get an estimate of the variation for the denominator. We divide the mean difference by the estimate of the variation, and we get $t$. It's the same as before.
The only wrinkle here is what goes into the denominator? How should we calculate the estimate of the variance? It would be nice if we could do something very straightforward like this, say for an experiment with two groups A and B:
$t = \frac{\bar{A}-\bar{B}}{(\frac{SEM_A+SEM_B}{2})}$
In plain language, this is just:
1. Find the mean difference for the top part
2. Compute the SEM (standard error of the mean) for each group, and average them together to make a single estimate, pooling over both samples.
This would be nice, but unfortunately, it turns out that finding the average of two standard errors of the mean is not the best way to do it. This would create a biased estimator of the variation for the hypothesized distribution of no differences. We won't go into the math here, but instead of the above formula, we an use a different one that gives as an **unbiased estimate of the pooled standard error of the sample mean**. Our new and improved $t$ formula would look like this:
$t = \frac{\bar{X_A}-\bar{X_B}}{s_p * \sqrt{\frac{1}{n_A} + \frac{1}{n_B}}}$
and, $s_p$, which is the pooled sample standard deviation is defined as, note the \$s\$es in the formula are variances:
$s_p = \sqrt{\frac{(n_A-1)s_A^2 + (n_B-1)s^2_B}{n_A +n_B -2}}$
Believe you me, that is so much more formula than I wanted to type out. Shall we do one independent $t$-test example by hand, just to see the computations? Let's do it...but in a slightly different way than you expect. I show the steps using R. I made some fake scores for groups A and B. Then, I followed all of the steps from the formula, but made R do each of the calculations. This shows you the needed steps by following the code. At the end, I print the $t$-test values I computed "by hand", and then the $t$-test value that the R software outputs using the $t$-test function. You should be able to get the same values for $t$, if you were brave enough to compute $t$ by hand.
```{r, echo=TRUE }