-
Notifications
You must be signed in to change notification settings - Fork 1
/
stats-jedi.Rmd
888 lines (615 loc) · 41 KB
/
stats-jedi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
---
title: "The Order of the Statistical Jedi: <div style='font-size: .8em;'>Responsibilities, Routines, and Rituals</div>"
author: "Dustin Fife"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
html_document:
toc: true
number_sections: false
documentclass: book
bibliography: [book.bib, packages.bib, book_citations.bib, references.bib]
biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---
---
title: "The Order of the Statistical Jedi: <div style='font-size: .8em;'>Responsibilities, Routines, and Rituals</div>"
author: "Dustin Fife"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
html_document:
toc: true
number_sections: false
documentclass: book
bibliography: [book.bib, packages.bib, book_citations.bib, references.bib]
biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---
# Preface
<hr>
<img src="drawings/coverjedi.jpg" alt="drawing" class="cover"/>
<!--chapter:end:index.Rmd-->
# Introduction {#intro}
Placeholder
## The power of repetition (and my...umm...*complicated* history with statistics)
## But there's a better way
## The Curriculum Hasn't Changed in 50 Years!
## The General Linear Model Approach
<!--chapter:end:01-intro.Rmd-->
# Ethics
Placeholder
## History of the Replication Crisis
### Dederick Stapel
### Darryl Bem
### The "P-Hacking" Article
## P-hacking
## The Scientific Method Movement
## Values versus Ethics
## The Open Science Values
### 1. Protecting humanity
### 2. Seek truth
### 3. Openness and transparency.
### 4. Humility and skepticism.
### 5. Dissemination.
## Making Change
## Further data analysis ethics.
<!--chapter:end:02-Ethics.Rmd-->
# Measurement
Placeholder
## Why am I talking about measurement?
## Constructs
## Operational Definitions
## Validity
### Evaluating Validity
## Reliability
### Evaluating reliability
### Increasing Reliability
## Variable types
### Predictor versus Outcome Variables
### Measurement scales
## Take-home message
<!--chapter:end:03-Measurement.Rmd-->
# Univariate Distributions
Placeholder
## Categorical Variables
### Column Sorting
### Visualizing
#### Line Chart
#### Pie Charts
#### Bar Charts
### Interpreting Bar Charts
#### Redundant Labels
#### Unknown/Unexpected Labels
#### Swapped Labels
#### Small Sample Sizes
#### Practice
#### Visualizing Bar Charts in R
#### Visualizing Bar Charts in JASP
## Numeric Variables
### What to Look Out For
#### Normal (Bell-Curved) Distributions
#### Outliers
#### Skewness
#### Zero-Inflated Data
#### Bimodal data
#### Visualizing Histograms in R
#### Visualizing Histograms in JASP
<!--chapter:end:04-UnivariateDistributions.Rmd-->
# Univariate Estimates
Placeholder
## Central Tendency: What's the most likely score?
### Mean
#### Advantages of the Mean
#### Disadvantages
### Mode
#### Disadvantages of the Mode
#### Advantages of the Mode
### Median
#### Disadvantages of the Median
#### Advantages of the Median
### Central Tendency in JASP
### Central Tendency in R
## Variability: How precise are the scores?
### Range
#### Advantages/Disadvantages of the Range
### Deviations, Standard Deviation, and Variance
#### Advantages/Disadvantages of the Standard Deviation
### Median Absolute Deviation
#### Advantages/Disadvantages of the MAD
### Variability in JASP
### Variability in R
## Z-Scores and Probability
<!--chapter:end:05-UnivariateEstimates.Rmd-->
# Bivariate Visualizations {#bivariate_visuals}
Placeholder
## Avengers Dataset
### Visualizing bivariate relationships in R using Flexplot
### Visualizing bivariate relationships in JASP using Visual Modeling
## Scatterplots: Numeric on numeric
### What to look for
### Problems to look out for
#### Nonlinearity
#### Bivariate Outliers
### Practice
## Beeswarm plots: Categorical on Numeric
### What to look for
### Problems to look out for
#### Uneven sample sizes
#### Skewness
## Other Bivariate Plots
### Logistic Plots: Numeric on Binary
### Association Plots: Categorical on Categorical
<!--chapter:end:06-BivariateDistributions.Rmd-->
# Bivariate Estimates
Placeholder
## Statistics Help Us Predict Things
## Conditional Estimates
## Estimates for Numeric Predictors
### Slopes and Intercepts
### Making Predictions
### When Slopes/Intercepts Don't Make Sense
### Correlation Coefficients
#### What's a "Good" Correlation?
#### Computing Estimates in R
## Estimates for Categorical Predictors
### Slopes and Intercepts for Categorical Predictors?
### Cohen's $d$
<!--chapter:end:07-BivariateEstimates.Rmd-->
# Diagnostics {-}
Placeholder
## Models are tools. And they don't have feelings.
## Residuals
## Diagnostic tool \# 1: Histogram of the residuals
### Sensitivity Analyses
## Diagnostic tool \# 2: Residual Dependence (RD) Plot for Linearity
### Statistical Models are Lazy
### Residual Dependence Plots
### How to Fix Nonlinearity
#### What is a Quadratic Term?
### How to tell if nonlinearity is a problem?
### How much nonlinearity is too much?
## Diagnostic tool \# 3: Scale-Location (SL) Plots for Homoscedasticity
### Spread-Location (SL) Plots
## Outliers
## Independence
### Why do models assume independence?
### What happens if you violate the independence assumption?
### How to detect and handle dependent data?
## Summary
<!--chapter:end:08-Diagnostics.Rmd-->
# The Linear Model
Placeholder
## Wax on, wax off
## What is a model?
## What is the Linear Model?
## What makes a good statistical model?
## Prediction Versus Group Differences
### Out with the old, in with the shiny
## One-Sample T-Test
### Traditional Analysis
### One-Sample T-Test as a LM
## Independent Sample T-Test
### Preparing Data for a t-test
### Traditional t-test Analysis
### LM Approach
## Related t-test
### Traditional Related t-test Analysis
### LM Analysis of a Related t-test
## ANOVA
### Traditional Analysis of ANOVA
#### Multiple comparisons
### ANOVA as a LM
## Regression
### Traditional Regression Analysis
### LM Approach
## Categorical Outcome Variables
## It's All the Same!
## Summary
<!--chapter:end:09-General-Linear-Model.Rmd-->
# Visualizing Multivariate General Linear Models {#vizmvglms}
```{r, include=FALSE}
knitr::opts_chunk$set(
comment = "#>", echo = TRUE, message=FALSE, note=FALSE, warning=FALSE, cache=TRUE
)
```
Years ago I was chit-chatting with my dad. I don't remember how it came up, but we were talking about the death penalty. I am vehemently opposed to the death penalty. My dad, on the other hand, is in favor. As we debated the pros and cons, I mentioned certain races were disproportionately affected.
"Hogwash," he said. "I've seen the statistics. They say that black men are no more likely than white men to receive the death penalty."
"True," I said. I was familiar with the statistic he was quoting.
But that's not the whole story. If a black man kills another black person, he is just as likely to get the death penalty as a white man who kills a white person. However, if a black man kills a *white* person, they are *much* more likely to receive the death penalty than if a white man kills a black person.
That there is unfair, you see. (Not to mention the fact that it's government-sponsored killing, and all that).
My dad's response?
"Well son," he said, "That just goes to show you can use statistics to prove anything."
That pissed me off, I must admit. It was a dismissal of my argument--a protestation that said my statistics were no better than his. But, as we'll learn, simple bivariate effects (in this case, race and death penalty) are always less complete than multivariate effects (in this case, race of perpetrator *and* race of victim).
## What is a Multivariate Relationship?
That there is a tricky question. And, you'll probably get different answers, depending on who you ask. Technically, multivariate means multiple variables. So, I suppose, any analysis that has more than one variable is a multivariate analysis.
But, I've actually seen people refer to multivariate relationships in one of two ways:
1. *When one performs an analysis of multiple **dependent** variables.* When statisticians use the word multivariate, they tend to be referring to this sort of analysis, where multiple dependent variables are investigated. These situations call for using Multivariate Analysis of Variance (MANOVA), factor analysis, principle component analysis, structural equation modeling, and canonical correlations. I personally find many (but not all) of these sorts of analyses archaic and quite useless. They are largely atheoretical, they confuse exploratory and confirmatory research, and they're hard to make sense of. That is *not* what this chapter is about.
2. *When one performs an analysis of multiple **independent** variables.* Most everybody, with the exception of heritage breed statisticians, use the term multivariate to refer to situations where multiple independent variables are used. This is what *I* mean when I say multivariate in this textbook. Does that make me a mutt-breed statistician? Maybe. But, sometimes you have to sacrifice authenticity for practicality.
So, yes, this chapter (and those that follow) are referring to situations where we try to predict a single outcome variable from multiple predictor variables. Yay.
## Reasons to use multivariate GLMs
The world is rarely simple. Rarely does one variable *alone* cause another variable. Rather, variables work together to produce causes. Or, one variable's effects are contingent upon anothers'. We use mutivariate GLMs to gain a more complete picture of what is influencing our outcome variable.
But, that's a bit vague and stuff. Let me be a little more specific. **There are three reasons you would want to do a multivariate analysis**:
1. *Study interaction effects.* Sometimes variables "interact," which means their influence depends on other variable(s). For example, the annoyance I feel (outcome variable) about having department meetings (predictor variable #1) *depends* on whether there's food there (predictor variable #2); I'm happy to attend department meetings if they have baklava and pizza. Otherwise, boooo.
2. You want to *control* for things you're (likely) not interested in. For example, you might know depressed people tend to have crappy social lives, but you don't care about studying social functioning. You really care about depression's unique influence on, say, health. If that's the case, we want to "control" for social functioning.
3. To improve predictions. Long story short, the more variables you have, the better your predictions. So, naturally, if you want to predict who will be the next world wood-chopping champion, you can add more predictors (e.g., bicep circumference, years of experience, beard length).
Before I confabulate about each of these reasons, let's take a mini-break to look at some pictures. Why not, eh? Pictures are pretty.
Though, as to not bother my mythical editor, I'm going to embed these pretty pictures within the process of data analysis.
## Visualizing Multivariate Relationships in Flexplot
In the previous section, I said there are three primary reasons for doing a multivariate analysis: to study interaction effects, to control for things, and to improve predictions. I'm going to temporarily put the third reason (prediction) aside and focus on interactions and conditioning. These two approaches have very different visualization strategies. But, before I can share those strategies, I need to show you some tools.
Tool time!
### Encoding Additional Dimension Using Colors/Lines/Symbols or Panels
As a reminder, flexplot is essentially God-like in its ability to visualize data. For univariate data, remember all we have to do is write the variable name, followed by `~1`, then the dataset. For example...
```{r mvglmuniv}
require(flexplot)
data(avengers)
flexplot(speed~1, data=avengers)
flexplot(superpower~1, data=avengers)
```
For bivariate data, the pattern is similar, except we now put `outcome~predictor, data=dataset`. In the background, flexplot is going to figure out whether to plot a scatterplot or bee swarm plot:
```{r mvglmbivar}
flexplot(speed~superpower, data=avengers)
flexplot(speed~agility, data=avengers)
```
Yes, flexplot can read your mind. (Just make sure your computer is password protected. Flexplot hasn't yet figured out how to steal your secret recipes from your documents folder, but it's also pretty smart...so, yeah.)
For multivariate relationships, flexplot is even more powerful. In fact, when I originally conceptualized it, my whole purpose was to develop an easy interface for multivariate visualizations. So, that's where it truly shines.
When trying to visualize interactions, it is important to see all the variables simultaneously. But how exactly do you visualize multivariate data? Alas, we humans can really only see in 3-4 dimensions (width, height, depth, and time). But that's not even true. We don't actually see depth. Instead, we rely on sharpness, size, and shading to mimic a third dimension. Even 3D movies aren't truly 3D--they just trick our vision into believing we're actually seeing in 3D.
We *could* use the same tricks for visualization to see three-dimensional data. For example,
```{r mvglm3d}
require(scatterplot3d)
data(avengers)
scatterplot3d(avengers$speed, avengers$agility, avengers$superpower)
```
(By the way, I wouldn't bother installing that package. We won't use it again.)
I just find these sorts of plots hard to understand. If I were able to view them as a hologram projection, I might find them useful. (Help us Obi Wan, you're our only hope!). But as a static picture, it's a bit tough to tell what's going on. And, 3D plots are limited to three variables; good luck visualizing four dimensions!
So what do we do instead? We can embed the third dimension in the same two-dimensional plot using a different mode of representation.
Lemme show you what I mean. Let's visualize the same data using flexplot:
```{r mvglmflex1}
flexplot(agility~speed + superpower, data=avengers, method="lm")
```
Notice how I've embedded a third variable (superpower) in the same plot as the two-dimensional data. From this, it's much easier to see the multivariate relationship: superheroes tend to decrease slightly in agility when speed increases, while non-superheroes tend to increase.
Nice.
Let's look at another example, but this time with two numeric variables:
```{r mvglmnumerictwo}
flexplot(agility~strength + speed, data=avengers, method="lm")
```
What did Flexplot do? Well, you done handed it a numeric variable and asked it to plot three variables at the same time. Since Flexplot refuses to resort to 3D plots, you gave it no other option. It decided to break speed into three equal segments (think of it as slow, average, and fast). Then it plotted the datapoints for the slow folks in red, the average folks in green, and the fast folks in blue. It also represented these levels as separate lines (dotted, dashed, and dot-dashed) and symbols (circles, squares, triangles).
But now it's hard to see what's going on. Sure, we can see the lines alright, but the raw data are such a jumbled mess it's impossible to tell whether the regression lines actually fit the data.
Quite a conundrum, that.
So, what to do? Well, let's investigate another technique called paneling. Paneling means that, rather than plotting the three levels of the strength variable as separate lines/colors/symbols, it plots each of the three levels in their own scatterplot (called a panel). To specify a panel, we use the code `y~x | z`. In other words, we replace the `+` with a `|`. That button's right above the return button.
Let's take a look, see:
```{r mvglmpanel}
flexplot(agility~strength | speed, data=avengers)
```
Now we're talking! Now there's no overlaps and we can see that some of these relationships are actually quite curvilinear (particularly the strength/agility relationship for the slow folks).
Maybe we'll model that curvilinearity with some quadratic lines:
```{r mvglmquadratic}
flexplot(agility~strength | speed, data=avengers, method="quadratic")
```
That's better.
But now we have a new problem; when the third variable was in the same panel, it was easy to make comparisons across the third variable. But now, it's much harder. For example, in the image above, which curved line is higher on agility? The average speed folks or the slow folks?
Hard to say.
But, we can use a visual aid called a "ghost line." A ghost line repeats the line from one panel to another panel. This will make it much easier to compare lines across panels. Let's look at an example:
```{r mvglmghost}
flexplot(agility~strength | speed, data=avengers,
method="quadratic",
ghost.line="red")
```
Now we're getting somewhere! As you can see, the ghost line duplicates the relationship from the middle panel across to the other panels. Now it's very clear that the curve from the slow group (right-most plot) is higher in agility than the other two.
```{block, type='rmdnote'}
*Why call it 'Ghost Line'?*
Ghost line is kind of a funny name, isn't it? How did I come up with that name? Well, back when I was a teenager, Nintendo 64 came out. That, my friends, was a brilliant system. One of its hallmark games was Mariokart. In Mariokart, you could choose which character to race with (e.g., Bowser, Mario, Luigi), then you would circle a track, trying to avoid shell-missiles.
Anyhow, as I recall, whenever you were racing and set a speed record, the game would actually record your every move during the entire race. You then had the option to turn on the "ghost racer." This ghost racer was a semi-transparent image of the exact race that holds the record. The basic idea was that you could try to race against the best race ever recorded.
Cool idea and cool concept. So much so, that I decided I would borrow the idea. Like in Mariokart, the ghost line doesn't actually reflect anything in that particular panel, but it is a representation of something outside the panel.
```
Ghost lines, it turns out, are essential for doing multivariate analyses, particularly when we start adding even *more* variables to the analysis. They make it easy to compare fits across panels.
Let's look at another example, but this time we'll panel on two variables:
```{r mvglmghost2}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="red")
```
Notice how *nearly every single datapoint* for the superhero plots (the bottom panel) are below the fit of the ghost line. In other words, superheroes are taking far fewer shots than non-superheroes. We would not be able to see this without the ghost line.
We'll see several examples in later chapters where the ghost line is essential to helping us figure out what's going on.
By the way, you may have noticed that the ghost line generally picks the middle-most panel as the reference. But, you can tell Flexplot which panel to reference from. Suppose, for example, if we wanted to reference from the fast panel for superheroes. To do so, we would need to specify a list that tells flexplot the speed and superpower value to reference from:
```{r mvglmghostref}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="red",
ghost.reference = list(speed=4.5, superpower="yes"))
```
Here, we've chosen a value for speed that is in the range of the left-most panel (4.5 is within the 4.3-4.9 range) and told flexplot to choose "yes" as the superpower value. We could also specify slow non-superheroes:
```{r mvglmghostref2}
flexplot(shots.taken~agility | speed + superpower, data=avengers,
method="quadratic",
ghost.line="orange",
ghost.reference = list(speed=5.2, superpower="no"))
```
Notice I also changed the color of the ghost line to orange.
By the way, you can (realistically) visualize up to four predictor variables. For example:
```{r mvglmfourvars}
flexplot(shots.taken~agility + superpower | speed + strength, data=avengers,
method="lm",
ghost.line="black")
```
Now, Flexplot is representing the second variable as different lines/colors/symbols while paneling the remaining two variables (speed and strength). Also, the ghost line is referencing off of the middle-most panel for the superheroes.
Don't worry about interpreting this plot yet; I'll soon give you some cool tricks you can use to figure out what's going on here. The important take-home message for you is understanding how you can control the display of flexplot.
In summary, flexplot visualizes multivariate data by encoding additional variables as either a separate color/symbol/line, or in separate panels. To encode a variable as a color/line/symbol, use `y~x+z`. To encode as separate panels, use `y~x|z`. When we panel, it makes it hard to compare fits, so we can use "ghost lines" to repeat the pattern from one panel to another.
## What are we looking for when studying a flexplot visual?
So, these graphics are all well and good, but what's the point? What information do they tell us? Or, maybe the better question is what are we looking for when studying a flexplot visual?
That's a fantastic question! Pat yourself on the back and buy yourself some ice cream :)
There are three things we're looking for when looking at a Flexplot graphic:
(1) Trends on the X-axis. This indicates a main effect of the variable on the X-axis.
(2) Nonparallel lines. This indicates there's an interaction between the variable on the X-axis and one of the other variables.
(3) Nonlinear effects. This is why Flexplot defaults to showing loess lines: if there are nonlinear effects, it will detect them.
I know, that was a lot of information. Let's go ahead and talk about each of these buggers, one at a time, shall we?
### Identifying trends in Flexplot
You already know this. Trust me. Remember back in our [bivariate visualization chapter](#bivariate_visuals) we identified any association between two variables by any sort of slope between X and Y. It's no different for multivariate plots.
The image below shows two plots; the right image has little to no trend. The plot on the left has a trend.
```{r mvglmtrend}
require(tidyverse)
set.seed(12332224)
x = rnorm(900)
y = .5*x + rnorm(length(x), 0, sqrt(1-.5^2))
z = rnorm(length(x))
d = data.frame(x=x, y=y, z=z)
a = flexplot(y~x | z, data=d, method="lm", se=F) + labs(title="Flexplot with a Trend (Slope)") +
theme(plot.title = element_text(size=18))
b = flexplot(y~z | x, data=d, method="lm", se=F) + labs(title="Flexplot without a Trend (Slope)") +
theme(plot.title = element_text(size=18))
require(patchwork)
a + b
```
### Identifying nonparallel lines in Flexplot
I'm not going to get into too much detail on this one in this chapter. That's the subject of my [chapter on interaction effects](#mvinteractions). But it's helpful for you to know this now, so you know why I'm telling you all this information.
Let's look at a graphic first. The plot on the left shows mostly parallel lines (with a red ghost line). What does that mean? It says that the relationship between X and Y is consistent: it doesn't matter the level of Z, the relationship between X and Y is consistently strong and positive. In the right plot, on the other hand, the relationship between X and Y changes drastically, depending on the level of Z; when Z is low, the relationship is essentially zero. As Z increases, the relationship between X and Y strengthens.
```{r mvglminteraction, echo=FALSE}
set.seed(123324)
n = 800
x = rnorm(n)
z1 = rnorm(n)
z2 = rnorm(n)
y1 = .5*x + .3*z1 + rnorm(length(x), 0, sqrt(1-.5^2))
y2 = .5*x + .3*z2 + .4*z2*x + rnorm(length(x), 0, sqrt(1-.5^2))
d = data.frame(x=x, y1=y1, y2=y2, z1=z1, z2=z2)
require(flexifiers)
a = (flexplot(y1~x | z1, data=d, method="lm", se=F, ghost.line="red") +
labs(title="Flexplot with Parallel Lines", x="X", y="Y") +
theme(plot.title = element_text(size=14))) %>%
modify_labels(row_panels="Z")
b = (flexplot(y2~x | z2, data=d, method="lm", se=F, ghost.line="red") +
labs(title="Flexplot with Nonparallel Lines", x="X", y="Y") +
theme(plot.title = element_text(size=14))) %>%
modify_labels(col_panels="Z")
require(patchwork)
a + b
```
We call this an "interaction effect." An interaction effect says that the effect of one variable on another (e.g., X on Y) depends on the level of a third variable (Z in this case). Interaction effects are pretty freaking important in stats--if they exist and we fail to identify them, it really screws up our model! So we want to be able to identify them. With ghost lines (and Flexplot), we have a way: if the lines are not approximately parallel, there's an interaction in the data that needs to be modeled.
### Identifying nonlinear effects
Like interaction effects, nonlinear effects are important to detect, if they're there. If we miss them (and don't model them), it really screws up our model and we might not even know it!
Fortunately, it's easy to identify nonlinear effects. We just ask Flexplot to show loess lines. Remember that loess lines bend with the data. So, if there's a bend, it will show that bend. Here's an example below. The plot on the left shows no nonlinear effects. The one on the right shows a nonlinear effect. When looking at a flexplot graphic with multiple panels, it's important to look for evidence of nonlinearity in any panel.
```{r mvglmnonlinear, echo=FALSE}
set.seed(123324)
n = 800
x = rnorm(n)
y1 = .5*x + rnorm(length(x), 0, sqrt(1-.5^2))
y2 = .5*x -.2*x^2 + rnorm(length(x), 0, sqrt(1-.5^2))
d = data.frame(x=x, y1=y1, y2=y2)
a = flexplot(y1~x , data=d, method="loess", se=F) +
labs(title="Flexplot without nonlinear effects", x="X", y="Y") +
theme(plot.title = element_text(size=14))
b = flexplot(y2~x , data=d, method="loess", se=F) +
labs(title="Flexplot with nonlinear effects", x="X", y="Y") +
theme(plot.title = element_text(size=14))
require(patchwork)
a + b
```
These multipaneled flexplots are kind of tricky to interpret, but they're pretty necessary. They allow us to detect interaction/nonlinear effects. Assuming there are none of these sorts of effects, we can use a *much simpler visualization tool* called Added Variable Plots.
But, that condition is pretty important, so I'm going to say it again, but with bold this time: **added variable plots are an easy way to interpret multivariate data, but you can only use them if there are no interactions/nonlinear effects in your model.**
(This statement is actually a bit simplified, and you can actually use them for visualizing models with these sorts of effects, but I'm not going to get into that here. If you're interested, you can read about it [in a paper I wrote](https://psyarxiv.com/avu2n/). )
### Encoding Additional Dimensions Using Added Variable Plots
Earlier I said we used multivariate analyses for two primary purposes: (1) to identify interactions, and (2) to "control" for other variables. The previous section displayed the additional variables colors/symbols/lines or as separate panels. These plots are great for detecting interactions. (Again, we'll review how this is done later). If there are no interactions, we can simplify the visualization immensely by using added variable plots (or AVPs).
Before we talk about AVPs, though, let me offer some definitions of important terms. Let's say you know that one's social support structure affects health. You also know that depression affects health, but you want to see the unique effect of depression on health, above and beyond social structure. When doing this sort of analysis, we have three types of variables: the outcome variable (you already know what this is), the "control" variable, and the "interest" variable.
**The control variable is the variable you're trying to control for.** Usually, we don't care about this variable. In our example, social support is the control variable.
**The interest variable is the variable whose unique effect you're interested in studying.** In this example, our interest variable is depression, because that's the variable whose influence we're actually interested in studying.
You can have multiple control variables (e.g., what is the effect of depression on health, after controlling for a. social support, and b. exercise frequency). You could also have multiple interest variables, but this is less common.
Let me quickly tell you what AVPs do, before I go into more detail:
**AVPs fit a model between the control variable(s) and the outcome, calculates the residuals, then plots those residuals against the interest variable.**
Again, I'll go over this in detail and explain *why* we do this. But, before I lose your interest or before I lose you in the details, here's an example of an AVP in R:
```{r mvglmavp1}
added.plot(agility~speed + strength, data=avengers)
```
By default, AVP's control for the first variable(s) entered (speed), then plot the last variable entered (strength) on the x-axis.
#### What Do AVPs Do and Why?
This will be easiest to explain with a simple example. (To see a more full explanation/example, see [my YouTube video on AVPs](https://youtu.be/AK1iZyY6lMo)) Let's say we have the following dataset:
```{r mvglmtab, echo=FALSE}
set.seed(121211)
cor.mat = matrix(c(
1, -.6, .9,
-.6, 1, -.2,
.9, -.2, 1
), nrow=3)
require(tidyverse)
d = data.frame(MASS::mvrnorm(n=10, mu=c(0,0,0), cor.mat))
names(d) = c("health", "depression", "support")
d = d %>% mutate(health = rescale(health, 20, 4) %>% floor_ceiling(10,30) %>% round,
depression = rescale(depression, 12,5) %>% floor_ceiling(1, 30) %>% round,
support = rescale(support, 50, 15) %>% floor_ceiling(0) %>% round)
mod = lm(health~support, data=d)
knitr::kable(d)
```
Here, `support` is our control variable. Let's go ahead and look at a scatterplot of that model:
```{r mvglmavpfitmod, echo=FALSE}
flexplot(health~support, data=d, method="lm", se=F)
```
The regression line has the equation: $\text{health} = `r round(coef(mod)[1], digits=1)` + `r round(coef(mod)[2], digits=1)`\times \text{support}$. Let's go ahead and put each person's predicted score back in the table, as well as their residual:
```{r, echo=FALSE}
mod = lm(health~support, data=d)
d$prediction = predict(mod) %>% round(1)
d$residual = residuals(mod) %>% round(1)
knitr::kable(d)
```
Just for fun, go ahead and verify that the column labeled `prediction` is equal to `r coef(mod)[1]` + `r coef(mod)[2]`$\times$ support. Also, go ahead and verify that the column labeled `residual` is equal to health - prediction.
Now, we have all we need to create an AVP. All we have to do is plot depression on the $X$ axis and the column labeled "residual" on the $Y$ axis:
```{r mvglmresiduals, echo=FALSE}
flexplot(residual~depression, data=d, method="lm", se=F)
```
What is this telling us? Remember what a residual means: *a residual tells us what remains in the outcome variable after removing the effect of whatever we modeled*. Here, we modeled the effect of social support on health. So, this residual represents health, after removing the effect of social support.
And that, my friends, is exactly what we were looking for! Whenever we are trying to "control" or "condition" on some variable, the appropriate visual to use is an AVP.
By the way, I didn't invent AVPs. (I wish I had!). But, I have made a few cool modifications to them.
### Dustin's Cool Modifications to Added Variable Plots
I think most people would be confused looking at added variable plots. Remember, we're plotting *residuals*, which means your outcome variable will now be centered on zero. I think if reviewers see this, they might have a heart attack.
"Wait a minute!" they might say, "Your DV ranges from 10-60 and yet your Y-axis ranges from -10 to 10. Clearly the authors are bozos and have done something wrong! Reject!"
Of course, you, being the wise statistical Jedi that you are, would recognize that it's plotting a *residual*, not a raw score. To avoid confusion from under-educated reviewers, Flexplot actually adds the mean of the DV back into the residuals. This makes sure the Y axis is on the same scale as the original DV. Flexplot also relabels the $Y$-axis to highlight that it's showing *residuals* not actual outcome scores. (In this case, it would say `health | support`).
So, in short, AVPs literally fit a linear model predicting the outcome from the control variable(s), compute the residuals, then plot the interest variable against these residuals. These plots visually show us the effect of the interest variable, after controlling for the control variable(s).
These were small modifications, I admit. Rescaling the residuals and adding labels to the Y-axis is more intuitive, but the next modification is more extreme. We're going to combine the awesomeness of Flexplot with the awesomeness of AVPs.
Remember how Flexplot is able to visualize multiple variables at a time? After a bit of math, I figured out how to extend AVPs to multivariate plots. Let's look at an example:
```{r mvglmmvavp, echo=FALSE}
added.plot(weight.loss~motivation + therapy.type,
lm_formula = weight.loss~health*muscle.gain,
data=exercise_data)
```
THe above plot is showing the relationship between two predictor variables (therapy.type and motivation) on weight loss, after controlling for health/muscle gain (and their interaction). That's pretty sweet.
#### Using Flexplot To Do AVPs
There's really two ways to use AVPs. One way is easy, but less flexible. The other is harder, but much more flexible. If you only have one interest variable and your control variables have simple relationships with the outcome, you can create AVPs like this:
```{r mvglmavpeasy}
added.plot(ptsd~agility + north_south, data=avengers)
```
The last variable entered (in this case, `north_south`) will be treated as the interest variable. All others will be control variables. Here's an example with multiple control variables:
```{r mvglmavpeasy2}
added.plot(ptsd~agility + speed + superpower + kills + north_south, data=avengers)
```
(These plots seem to say that, even after controlling for agility, speed, superpowers, and # of kills, which battlefield one fought on [north versus south] *still* predicts ptsd).
Sometimes, however, you might want to have multiple interest variables. Or maybe you have quadratic and/or interaction effects you need to model for the control variables. In that case, you'll need to add the `lm.formula` argument to AVP:
```{r mvglmcomplex}
added.plot(ptsd~willpower + north_south, data=avengers,
lm_formula = ptsd~minutes.fighting + I(minutes.fighting^2))
```
The above example fits a quadratic (nonlinear) regression between `minutes.fighting` and `ptsd`, then plots willpower and north_south against these residuals. In other words, `minutes.fighting` (and the nonlinear effect of `minutes.fighting`) are the "control variables" and willpower and north_south are the interest variables.
In other words, the `lm_formula` argument simply tells R which predictors to use to create the residuals, and the formula at the beginning is interpreted exactly as it would be in flexplot.
## Summary
For this book, "multivariate" means we have multiple independent variables. We use multivariate analyses for three reasons 1) to model interactions, 2) to control for things, and/or 3) to predict things.
When visualizing, it's important to consider the role of each variable. In this chapter, I mentioned three roles: the dependent variable, the control variable (which we usually don't care about), and the interest variable.
To visualize multivariate data, we can either encode our third variable as a different color/line/symbol (using the syntax `y~a+b`), or we can panel that variable (using the syntax `y~a|b`. Using separate lines/colors/symbols tends to clutter the display, while panels make it harder to compare across panels. To make it easier to compare across panels we can use ghost lines.
When studying paneled plots, we're looking for trends, curvilinearity, and/or interactions. Ideally, there are no interactions and we won't have to visualize multiple variables together, but can instead use added variable plots. Using added variable plots only makes sense if our variables don't interact with one another.
With AVPs, we can either use the syntax `outcome ~ control_variable + interest_variable`, where the last variable entered is the one displayed on the x-axis (while the other variables are residualized), or we could use this syntax: `added_plot(y~a + b, lm_formula = y~c + d, data=d)`. In the latter case, the formula specified in `lm_formula` tells the avp which variables to create residuals from, while the other variable(s) (in this case `a` and `b`) are displayed using regular flexplot syntax.
## Practice
1. Describe how each of the following relationships will look. Try drawing a plot that matches your expectations. Then plot the actual relationships. How close were you?
```{r, eval=FALSE}
flexplot(superpower~1, data=avenger)
flexplot(speed~agility + superpower, data=avenger)
flexplot(agility~superpower | speed, data=avenger)
flexplot(agility~speed + strength | superpower, data=avenger)
flexplot(agility~speed | strength + superpower, data=avenger)
```
2. Suppose we wanted to plot the relationship between agility and speed, after controlling for superpower. How would you use an AVP to do that?
3. Suppose our interest variables are speed and agility, our outcome variable is ptsd, and our control variables are kills and injuries (and the interaction between the two). How would you use an AVP to visualize that?
<!--chapter:end:10-MultivariateGLMs.Rmd-->
# Multivariate GLMs: Conditioning Effects {#multivariate-glms-conditioning-effects}
Placeholder
## Multicollinearity
## Controlling by conditioning
## Conditioning is just residualizing
## All the ways of thinking about "conditioning"
## Be careful about conditioning! (And using multiple regression)
### 1. Conditioning will not prove causation.
### 2. Be Careful what you condition on
### 3. Only study and interpret the effects of the interest variable
### 4. Conditioning with interaction effects.
## When should you use a conditioning analysis?
## Additional Estimates of Interest
### Slopes
### R squared.
### Semi-Partial $R^2$
## Applied Analyses
### ANCOVA
#### Reporting Results for an ANCOVA
##### Example Reporting of ANOVA (Radical)
##### Example Reporting of ANOVA (Traditionalist)
##### Example Reporting of ANOVA (Conventionalist)
### Multiple Regression
#### Reporting of a Multiple Regression
##### Example Reporting of multiple regression (Radical)
##### Example Reporting of multiple regression (Traditionalist)
##### Example Reporting of multiple regression (Conventionalist)
<!--chapter:end:11-GLM-conditioning.Rmd-->
# Multivariate GLMs: Interaction Effects {#mvinteractions}
Placeholder
## Visualizing interaction effects
### A simple visual trick to tell if there's an interaction
## Interactions between numeric variables
### Simple Slopes Analysis
### The Flexplot Approach to Interpreting Interactions
## The LM for interaction effects
## Common things people screw up in the literature
### Gripe #1. Interpreting main effects when interactions exist
### Gripe #2: Failing to check whether interactions exist when doing an ANCOVA
## Estimates for interactions
## Applied Analyses
### Factorial ANOVA
#### Reporting Results for a Factorial ANOVA
### Multiple Regression
#### Reporting Results for a Multiple Regression
### Mediation Analysis
#### Reporting Results for a Multiple Regression
<!--chapter:end:12-GLM-Interactions.Rmd-->
# Probability
Placeholder
## Why and when we need probability?
## Finite Samples
## Infinite sets
## Infinite Sets and Sampling
## How to ensure a representative sample
## Probability Density Functions
### Computing Probabilities From PDFs
## Chapter Summary
<!--chapter:end:14-Probability.Rmd-->
# Probability Two: Bayesian Probabilities (Versus Frequentist Approaches) {#bayesprobability}
Placeholder
## A Tale of Two Roomates
### Tom's Approach
### Egon's Approach
### What do they conclude?
## The Bayesian Approach
### Strengths of the Bayesian approach
### Weaknesses/Objections to the Bayesian Approach
## Frequentist/Likelihood Description
### Strengths
### Weaknesses
## Doing Bayesian Analyses in R
<!--chapter:end:15-Probability-2.Rmd-->
# Probability 3: The Central Limit Theorem
Placeholder
## Groundhog Day
## The Central Limit Theorem
## Implications of the CLT
### Implication #1: Normality doesn't really matter
### Implication #2. If your sample size is large enough, it's quite likely your estimate is close to the true value
### Implication #3: We can (kinda sorta) make inferences about the population
## Using the CLT to Making Inferences
### Confidence intervals
#### Interpreting Confidence Intervals
### p-values
#### The logic of hypothesis testing
## Learning Objectives
<!--chapter:end:16-Probability-3.Rmd-->
# Model Comparisons
Placeholder
## Nested versus non-nested Models
## The Fit/Complexity Tradeoff
## Model comparisons are tools, not procedures
## Visual model comparisons
## The `model.comparison` function
### AIC/BIC
### Bayes Factors
### p values
### $R^2$
### The predicted values
### What if the statistics disagree?
## Hierarchical Regressions: Nested Model Comparisons in SPSS
## Non-nested models
## Summary
<!--chapter:end:17-ModelComparisons.Rmd-->
<!--chapter:end:scratch.Rmd-->