-
Notifications
You must be signed in to change notification settings - Fork 1
/
10-MLR-Inference.Rmd
587 lines (400 loc) · 27.5 KB
/
10-MLR-Inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
# Inference in MLR {#mlrinference}
<!--- For HTML Only --->
`r if (!knitr:::is_latex_output()) '
$\\newcommand{\\E}{\\mathrm{E}}$
$\\newcommand{\\Var}{\\mathrm{Var}}$
$\\newcommand{\\bmx}{\\mathbf{x}}$
$\\newcommand{\\bmH}{\\mathbf{H}}$
$\\newcommand{\\bmI}{\\mathbf{I}}$
$\\newcommand{\\bmX}{\\mathbf{X}}$
$\\newcommand{\\bmy}{\\mathbf{y}}$
$\\newcommand{\\bmY}{\\mathbf{Y}}$
$\\newcommand{\\bmbeta}{\\bm{\\beta}}$
$\\newcommand{\\XtX}{\\bmX^\\mT\\bmX}$
$\\newcommand{\\mT}{\\mathsf{T}}$
$\\newcommand{\\XtXinv}{(\\bmX^\\mT\\bmX)^{-1}}$
'`
```{r include=FALSE}
library(tidyverse)
library(broom)
```
## What kind of hypothesis test?
In multiple linear regression, there are three types of questions we could ask about the importance of the predictor variables. They differ by whether they involve one predictor, all predictors, or a subset of predictors.
**Single Variable**
* What is the importance of a specific predictor?
* This question can be addressed with a T-test (Section \@ref(mlrttest)).
* Examples
* What is the relationship, if any, between average air pollution levels and cardiovascular mortality rates, after controlling for temperature?
* What is the relationship, if any, between gender and salary, controlling for years of experience and position title?
* What is the relationship, if any, between marijuana use and opioid overdoses, controlling for socioeconomic status?
* What is the relationship, if any, between hours spent on homework and final exam score, controlling for class level and department?
**All Variables**
* What is the overall importance of the model?
* This question can be addressed with a global F-test (Section \@ref(mlrftest)).
* Examples:
* How well do temperature and air pollution explain variation in cardiovascular mortality rates?
* Do gender, years of experience, and position title explain differences in salary?
* Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status?
* Can we predict final exam score using time spent on homework, class level, and department?
**Subset of Variables**
* Which subsets of predictors are important?
* This question can be addressed with a partial F-test (Section \@ref(mlrpartialftest)).
* Examples:
* How well do temperature and air pollution explain variation in cardiovascular mortality rates, when accounting for underlying health status?
* Do gender and years of experience explain differences in salary after adjusting for position title?
* Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status, when accounting for healthcare provider networks?
* Can we predict final exam score using time spent on homework and year in school, when accounting for subject area?
<!-- Questions we can ask about our model: -->
<!-- 1. What is the importance of a specific predictor? (T-test) -->
<!-- 2. What is the overall importance of the model? (F-test) -->
<!-- 3. Which subsets of predictors are important? (Partial F-tests) -->
<!-- Question 1 Examples: -->
<!-- Question 2 Examples: -->
<!-- We will first consider Question 1 (specific predictors), and then turn to Question 2 (overall significance) and Question 3 (groups of predictors). -->
## Photosynthesis Data
For examples in this chapter, we will use data on photosynthesis output in trees from Reich *et al.*, *Nature*, 2018.^[Reich, P.B., A. Stefanski, K.M. Sendall, and R.L. Rich. 2018. Photosynthetic data on experimentally warmed tree species in northern Minnesota, 2009-2011, used in the paper Reich et al Nature 2018. ver 2. Environmental Data Initiative. https://doi.org/10.6073/pasta/258239f68244c959de0f97c922ac313f] They measured photosynthesis output under different conditions, including variations in the amount of water in the soil and temperature in the surrounding air.
We can fit a multiple linear regression model for photosynthesis output ($Y$) that adjusts for soil moisture content ($x_1$), an indicator of whether the tree was artificially warmed ($x_2$), and leaf temperature ($x_3$).^[For this example, we are ignoring some features of the design such as correlation by plot.]
```{r eval=TRUE, echo=FALSE, include=FALSE, message="hide", size="footnotesize"}
photo <- read_csv("data/photo.csv", col_types = cols())
photo <- subset(photo,
!is.na(soil_water) & !is.na(warming_treatment) &
!is.na(tleaf))
head(photo)
```
```{r eval=FALSE, echo=F, message="hide", warning=FALSE, fig.height=6, fig.width=8}
g1 <- ggplot(photo) +
geom_point(aes(x=soil_water, y=photosyn, col=warming_treatment)) +
scale_color_discrete(name="Warming Treatment: ") + xlab("Soil Water Content Ratio") + ylab(expression("Photosynthesis Output ("~mu~"mol"/m^2/s~")"))
g2 <- ggplot(photo) +
geom_point(aes(x=tleaf, y=photosyn, col=warming_treatment)) +
xlab("Leaf Temperature (deg C)") + ylab(expression("Photosynthesis Output ("~mu~"mol"/m^2/s~")"))
g12 <- plot_grid(g1 + theme(legend.position="none"),
g2 + theme(legend.position="none"),
labels = c("", ""),
ncol=2)
legend_b <- get_legend(g1 + theme(legend.position="bottom",legend.justification="center"))
plot_grid( g12, legend_b, nrow = 2, rel_heights = c(1, .2), align="hv")
```
```{r eval=TRUE, echo=TRUE}
ph_lm <- lm(photosyn~soil_water + warming_treatment + tleaf,
data=photo)
```
We obtain the fitted model:
\begin{equation}
\hat y = 3.89 + 40.5x_{1} + 1.4x_{2} -0.022x_3
(\#eq:photofitted)
\end{equation}
```{r eval=FALSE, echo=FALSE}
ph_lm <- lm(photosyn~soil_water + warming_treatment + tleaf,
data=photo)
summary(ph_lm)
```
<!-- ### Hypothesis Tests for $\beta_1$ in Photosynthesis data -->
## Hypothesis Tests for $\beta_j$ {#mlrttest}
### Scientific vs. Statistical Question
Using this model \@ref(eq:photofitted), we could ask the **scientific question:**
Is there a relationship between soil water content ratio and photosynthesis output, *after adjusting for leaf temperature and warming treatment*?
To translate this into a statistical question, we need to isolate what represents the relationship between soil water content ratio and photosynthesis output. This is precisely $\beta_1$, since it represents the slope of the relationship between soil water content ($x_1)$ and average photosynthesis outtput ($E[Y]$), for trees with the same value of leaf temperature and warming treatment.
Thus, our corresponding **statistical question** is:
Is $\beta_{1}\ne 0$ in this model?
### General Form of $H_0$
The standard setup for a hypothesis test of a single coefficient parameter in MLR is similar to SLR. We consider null and alternative hypotheses of the form
\begin{equation}
H_0: \beta_j = \beta_{j0} \quad \text{vs.} \quad H_A: \beta_j \ne \beta_{j0}
(\#eq:mlrH0)
\end{equation}
One-sided hypotheses are also possible, although less common than the two-sided versions.
To test the null hypothesis in \@ref(eq:mlrH0), we use the $t$-statistic that has the same form as the one from SLR:
$$t = \frac{\hat\beta_j - \beta_{j0}}{\widehat{se}(\hat\beta_j)}$$
However unlike in SLR, the standard error in the denominator, $se(\hat\beta_j) = \sqrt{\hat\sigma^2(\XtX)^{-1}_{jj}}$, depends on $x_j$ and all of the other $x$'s. This means that the correlation between predictor variables can impact the results of the hypothesis test (and the width of confidence intervals).
When $H_0$ is true, $t$ follows a T-distribution with $n-p$ degrees of freedom. The corresponding $p$-value is computed in the usual way: $p = P(T> |t|)$. The $t$ statistic and $p$-value provided by R in standard output correspond to $\beta_{j0}=0$.
:::: {.examplebox}
```{example}
In the photosynthesis data, is there evidence of a relationship between soil water content ratio and average photosynthesis output, *after adjusting for leaf temperature and warming treatment*?
```
To answer this, we compute the test statistic and obtain:
$$t = \frac{\hat\beta_1 - \beta_{10}}{se(\hat\beta_1)} = \frac{40.5 - 0}{2.84} = 14.24$$
The corresponding $p$-value is:
$$P(|T_{1311}| > |14.24|) < 0.0001$$
Thus, we reject the null hypothesis that $\beta_1 = 0$ and conclude that there is a linear relationship between soil water content and photosynthesis output, when adjusting for warming treatment and leaf temperature.
All of the information necessary to conduct this hypothesis test is availble in the standard summaries of an `lm` object in R. For example:
```{r}
tidy(ph_lm)
```
::::
## Confidence Intervals for $\beta_j$
Confidence intervals for the $\beta$ parameters have the same form as in SLR:
\begin{equation}
(\hat\beta_j - t_{1-\alpha/2}\widehat{se}(\hat\beta_j), \hat\beta_j + t_{1-\alpha/2}\widehat{se}(\hat\beta_j))
(\#eq:mlrCI)
\end{equation}
where $t_{1-\alpha/2}$ is such that $P(T_{n-p} < t_{1-\alpha/2}) = 1-\alpha/2$.
The interval in \@ref(eq:mlrCI) is a random interval that, assuming the model is correct, includes the true value of the parameter $\beta_j$ with probability (1-$\alpha$).
The same functions that provide confidence intervals in R for SLR (`confint()` and `broom::tidy()`) provide them for models with multiple variables. If desired, the intervals can also be constructed from the individual components in \@ref(eq:mlrCI).
:::: {.examplebox}
```{example}
To construct a confidence interval for $\beta_1$ in \@ref(eq:photofitted), we first compute the necessary elements:
```
* $\hat\beta_1 = 40.5$
* $t_{1 - \alpha/2} = 1.96$
* $\widehat{se}(\hat\beta_1) = 2.84$
We then compute the interval:
$$(40.5 - 1.96*2.84, 40.5 + 1.96*2.84) = (34.9, 46.1)$$
In R, we could accomplish this using:
```{r eval=FALSE, echo=TRUE}
t_alphaOver2 <- qt(p=0.975, df=nobs(ph_lm))
b1hat <- coef(ph_lm)[2]
seb1hat <- sqrt(diag(vcov(ph_lm)))[2]
c(Lower=b1hat - t_alphaOver2*seb1hat,
Upper=b1hat + t_alphaOver2*seb1hat)
```
But in practice, it is much faster and simpler to use `tidy()`:
```{r eval=TRUE, echo=TRUE}
tidy(ph_lm, conf.int=TRUE, conf.level=0.95)
```
::::
## Testing for Significance of Regression (Global F-Test) {#mlrftest}
### Scientific vs. Statistical Question
A different **scientific question** we could ask about model \@ref(eq:photofitted) is:
Is there any linear relationship between all of these predictor variables (soil water content ratio, tree warming status, and leaf temperature) and photosynthesis output?
Our corresponding **statistical question** is:
Is $\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0$?
In words, this corresponds to:
$H_0$: There is no linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.
$H_A$: There is a linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.
### Global F-test
<!-- This question is a **global F-test**. We can formalize this as the null and alternative hypotheses: -->
<!-- \begin{align*} -->
<!-- H_0:& \beta_1 = \beta_2 = \beta_3 = 0 \\ -->
<!-- H_A:& \beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0 -->
<!-- \end{align*} -->
The general form of this set of hypotheses, called a **Global F-Test**, is:
\begin{align*}
H_0:& \beta_1 = \beta_2 = \cdots = \beta_k = 0 \\
H_A:& \beta_j \ne 0 \text{ for at least one \textit{j}}
\end{align*}
An important limitation to the Global F-test is that $H_A$ does not specify *which* coefficient is non-zero, only that *at least 1* is non-zero. In many cases, we might follow-up a Global F-test with an additional t-test for a specific parameter.^[However, this leads to issues of multiple comparisons that need to be addressed.]
In MLR, we use the same approach to testing $H_0$ that we used in SLR (Section \@ref(slrftest)). The F statistic is
$$f = \frac{SS_{Reg} / df_{Reg}}{SS_{Res}/ df_{Res}}.$$
As with SLR, the sum of squares decomposition is the same:
\begin{align*}
SS_{Tot} &= \sum_{i=1}^n (y_i - \overline{y})^2\\
SS_{Reg} &= \sum_{i=1}^n (\hat y_i - \overline{y})^2\\
SS_{Res} &= \sum_{i=1}^n (y_i - \hat y_i)^2\\
SS_{Tot} &= SS_{Reg} + SS_{Res}
\end{align*}
The degrees of freedom scale up and down according to the number of predictor variables. With $k$ predictor variables, $df_{Reg} = k$. Note that this is not $p=k+1$, since we do not account for the intercept here. The denominator degrees of freedom is $df_{Res} = n - p = n - (k + 1) = n - k - 1$. When $H_0$ is true (and assuming approximate normality), $f$ follows an $F_{df_{Reg}, df_{Res}}$ distribution. We reject $H_0$ is $f$ is large enough (using the test level $\alpha$).
:::: {.examplebox}
```{example}
In the photosynthesis data, is there evidence of a linear relationship between any of the predictor variables and the average photosynthesis output? To answer this, we compute the $f$ statistic:
```
$$f = \frac{5863/3}{36803/(1315 - 4)} = 69.6$$
The corresponding $p$-value if $P(F_{3,1311} > 69.6) < 0.0001$. We reject the null hypothesis that there is no linear relationship between average photosynthesis output and the soil water content ratio, tree warming status, and leaf temperature.
R provides the $F$ statistic and $p$-value for a test of significance of regression. This is provided at the bottom of `summary()` output:
```{r echo=TRUE}
summary(ph_lm)
```
```{r echo=FALSE, eval=F}
SSres <- sum(residuals(ph_lm)^2)
SStot <- sum((photo$photosyn- mean(photo$photosyn))^2)
SSreg <- SStot - SSres
f <- (SSreg/3)/(SSres/(nobs(ph_lm)-4))
f
pf(f, df1=3, df2=nobs(ph_lm)-4, lower=F)
```
::::
### ANOVA Table
The information for the global F-test can be summarized in an ANOVA table:
| Source of Variation | Sum of Squares | Degrees of Freedom | MS | F |
|:-----|:------|:--:|:--:|:--:|
|Regression | $SS_{reg}$ | $k$ | $MS_{reg}$ | $MS_{reg}/MS_{res}$|
|Residual | $SS_{res}$ | $n-(k+1) = n- p$ | $MS_{res}$ | -- |
|Total | $SS_{tot}$ | $n-1$ | -- | -- |
## Testing for Subsets of Coefficients {#mlrpartialftest}
### Scientific vs. Statistical Question
A different **scientific question** we could ask about model \@ref(eq:photofitted) is:
Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content?
Our corresponding **statistical question** is:
Is $\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0$?
Compared to the question for the Global F-test, we are now posing a question about a *subset* of the predictor variables. This is the same as comparing the "full" model:
$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \epsilon_i$$
and the "reduced" model
$$Y_i = \beta_0 + \beta_1x_{i1} + \epsilon_i.$$
This corresponds to the following null and alternative hypotheses:
\begin{align}
H_0:& \beta_2 = \beta_3 = 0\\
H_A:& \beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0
(\#eq:mlrpartialftest23)
\end{align}
<!-- In words, this corresponds to: -->
<!-- $H_0$: There is no linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature. -->
<!-- $H_A$: There is a linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature. -->
<!-- Let's now do a formal statistical test to compare: -->
<!-- \begin{align*} -->
<!-- \text{Photosyn. Output} &= \text{Soil WC}\\ -->
<!-- \text{Photosyn. Output} &= \text{Soil WC + Warming Treatment + Leaf Temp.} -->
<!-- \end{align*} -->
### Partial F-test
The hypotheses in \@ref(eq:mlrpartialftest23) can be tested using a **Partial F-test**, which calculates an $f$ statistic comparing a "full" model with all predictor variables to the "reduced" model with a subset of parameters.
<!-- The general form of this set of hypotheses, called a **Partial F-Test**, is: -->
<!-- \begin{align*} -->
<!-- H_0:& \beta_2 = \beta_3 = 0\\ -->
<!-- H_A:& \beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0 -->
<!-- \end{align*} -->
The corresponding test statistic is:
\begin{equation}
f = \dfrac{\left(SS_{reg}^{Full} - SS_{reg}^{Reduced}\right)/ r}{SS_{Res}/(n - p)}
(\#eq:mlrpartialf)
\end{equation}
The numerator of $f$ is the difference in variation explained by the full model ($SS_{reg}^{Full}$) and the variation explained by the reduced model ($SS_{reg}^{Reduced}$).
This difference is scaled by the number of variables ($r$) that are being set to zero in reduced model. The denominator is, like with the Global F-test, the average amount of variability unexplained by the full model. The $f$ statistic in \@ref(eq:mlrpartialf) is compared against an $F_{r, n-p}$ distribution to obtain the $p$-value.
Sometimes it is convenient to represent the full and reduced models compactly using vector and matrix notation. First, partition the full vector of coefficients $\boldsymbol\beta$ into two subvectors: $$\boldsymbol\beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{k-r} \\ \beta_{k-r+1} \\ \vdots \\ \beta_k\end{bmatrix} = \begin{bmatrix} \boldsymbol\beta_A \\ \boldsymbol\beta_B \end{bmatrix}$$
Similarly, partition $\bmX$ into two parts:
$$\bmX = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1,k-r} & x_{1, k-r + 1} & \cdots & x_{1k} \\ \vdots & \vdots & & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \cdots & x_{n,k-r} & x_{n, k-r + 1} & \cdots & x_{nk} \end{bmatrix} = \begin{bmatrix} \bmX_A & \bmX_B \end{bmatrix}$$
Without loss of generality, we assume that the reduced model corresponds to setting $\boldsymbol\beta_B = \mathbf{0}$.^[If we want to set a different subset equal to zero, just rearrange the vector.] The Full Model is:
$$\bmy = \bmX\boldsymbol\beta + \boldsymbol\epsilon = \bmX_A\boldsymbol\beta_A + \bmX_B\boldsymbol\beta_B + \boldsymbol\epsilon$$
The Reduced Model is:
$$\bmy = \bmX_A\boldsymbol\beta_A + \boldsymbol\epsilon$$
This gives us the general form of the null and alternative hypotheses for a partial F-test:
$$H_0: \boldsymbol\beta_B = \mathbf{0} \text{ vs. } H_A: \boldsymbol\beta_B \ne \mathbf{0}$$
Let $SS_{Reg}(\boldsymbol\beta)$ denote the ususal regression sum of squares from the full model and $SS_{Reg}(\boldsymbol\beta_A)$ denote the regression sum of squares from the reduced model. Then $SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)$ is the "extra sum of squares" due to $\boldsymbol\beta_B$ and we can write the $f$ statistic from \@ref(eq:mlrpartialf) as
$$f = \dfrac{(SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)) / r}{SS_{Res}/(n - p)}$$
Again, we complete the one-sided test by comparing to an $F_{r, n-p}$ distribution to obtain the $p$-value.
<!-- ### Partial F-Test -- Matrix form -->
<!-- \vspace{0.5cm} -->
<!-- \footnotesize -->
<!-- \begin{tabular}{l c c} -->
<!-- \hline -->
<!-- & Full Model & Reduced Model \\ -->
<!-- \hline -->
<!-- Model & $\bmy = \bmX\boldsymbol\beta + \boldsymbol\epsilon$ & $\bmy = \bmX_A\boldsymbol\beta_A + \boldsymbol\epsilon$ \\ -->
<!-- $\hat{\boldsymbol\beta}$ & $\hat{\boldsymbol\beta} = (\bmX^\mT\bmX)^{-1}\bmX^\mT\bmy$ & $\hat{\boldsymbol\beta}_A = (\bmX_A^\mT\bmX_A)^{-1}\bmX_A^\mT\bmy$ \\ -->
<!-- $SS_{Res}$ & $SS_{Res}(\boldsymbol\beta) = (\bmy - \bmX\hat{\boldsymbol\beta})^\mT(\bmy - \bmX\hat{\boldsymbol\beta})$ & $SS_{Res}(\boldsymbol\beta_A) = (\bmy - \bmX_A\hat{\boldsymbol\beta}_A)^\mT(\bmy - \bmX_A\hat{\boldsymbol\beta}_A)$ \\ -->
<!-- $SS_{Reg}$ & $SS_{Reg}(\boldsymbol\beta) = SS_{Tot} - SS_{Res}(\boldsymbol\beta)$ & $SS_{Reg}(\boldsymbol\beta_A) = SS_{Tot} - SS_{Res}(\boldsymbol\beta_A)$\\ -->
<!-- \hline -->
<!-- \end{tabular} -->
<!-- \vspace{0.5cm} -->
<!-- \normalsize -->
<!-- * If $H_0$ is true ($\boldsymbol\beta_B = \mathbf{0}$), then $f \sim F_{r,n-p}$. -->
<!-- * Reject $H_0$ if $f$ is too large. -->
To perform partial F-test in R:
* Fit the full model
* Fit the reduced model
* Use `anova(reduced_mod, full_mod)`
:::: {.examplebox}
```{example}
Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content?
```
We first fit the full model:
```{r echo=TRUE}
ph_lm <- lm(photosyn~soil_water + warming_treatment + tleaf,
data=photo)
```
and then the reduced model:
```{r echo=TRUE}
ph_lm_reduced <- lm(photosyn~soil_water,
data=photo)
```
For comparing the models, we use the `anova()` command:
```{r echo=TRUE}
anova(ph_lm_reduced, ph_lm)
```
The column `RSS` gives $SS_{res}^{Red}$ and $SS_{res}^{Full}$. Since $SS_{res}^{Red} - SS_{res}^{Full} = SS_{reg}^{Full} - SS_{reg}^{Red}$, we can use these values to get the
extra sum of squares, which is provided in the `Sum of Sq` column. The right two columns provide the value of $f$ and the $p$-value.
We reject $H_0$ and conclude that there is a linear relationship between average photosynthesis output and warming treatment and leaf temperature in a model that already contains soil water content ($p < 0.0001$).
::::
:::: {.examplebox}
```{example}
Is there any linear relationship between average photosynthesis output and leaf temperature in a model that already contains soil water content and an indicator of warming treatment?
```
For this example, our null hypothesis is $H_0: \beta_3 = 0$. If we use the partial F-test for this hypothesis, we obtain:
```{r echo=TRUE, size="footnotesize", output.lines=-1:-2}
ph_lm_reduced2 <- lm(photosyn~soil_water + warming_treatment, data=photo)
anova(ph_lm_reduced2, ph_lm)
```
We would thus fail to reject the null hypothesis. This is the exact same result we would have obtained if we had used a $t$-test. In fact, we once again have the relationship $f = t^2$, when the partial $f$-test is for setting *one* (and only one) coefficient equal to zero.
::::
## Adjusted $R^2$ {#adjr2}
In Section \@ref(r2), we introduced the coefficient of determination, $R^2$, as measure of model fit. It might be nice to use $R^2$ as a way to compare different models, but it suffers from one important drawback: adding a variable to a model cannot decrease $R^2$. In fact, $R^2$ will almost always go up when a variable is added. This is a mathematical fact, and it doesn't matter if the new variable is unrelated to the outcome or not.
To get around this, we can compare model using **adjusted $R^2$**:
\begin{align*}
R^2_{Adj} &= 1 - \frac{SS_{Res}/(n - p)}{SS_{Tot}/(n-1)}
\end{align*}
To identify the differences, it helps to compare $R^2_{Adj}$ to the regular version of $R^2$:
\begin{align*}
R^2 &= 1 - \frac{SS_{Res}}{SS_{Tot}}\\
\end{align*}
In $R^2_{Adj}$, the value of $SS_{Res}$ is divided by the residual degrees of freedom and $SS_{tot}$ is divided by the total degrees of freedom (minus one). If we were to add a variable that did not help explain the outcome variable, then $SS_{Res}$ would stay the same but $SS_{Res}/(n- p)$ would increase slightly. This results in a smaller value of $R^2_{Adj}$. As the sample size gets larger (i.e., as $n \to \infty$), then $R^2_{Adj}$ approaches the value of $R^2$.
It's important to note that while we can use $R_{Adj}^2$ for model comparison, it no longer represents the percentage of variability in the outcome explained by variation in the predictors. That interpretation still belongs to $R^2$.
## Estimation and Prediction of the Mean
To complete our description of inference for the MLR model, we now turn to estimation and prediction of the mean.
Let $\bmx_0^\mT = \begin{bmatrix} 1 & x_{01} & x_{02} & \cdots & x_{0k}\end{bmatrix}$ denote a specific set of predictor variables.
The estimated mean corresponding to these values is
$$\hat\mu_{0} = \bmx_0^\mT\hat{\boldsymbol\beta} = \hat\beta_0 + x_{01}\hat\beta_1 + \dots + x_{0k}\hat\beta_k$$
As with SLR, this is the same value as a predicted observation, $\hat y_0 = \bmx_0^\mT\boldsymbol\beta$.
### Confidence Intervals for the Mean
To represent uncertainty, we use a confidence interval for the mean. This requires calculating the standard error of $\hat\mu_0$. Using matrix notation, calculating the variance of $\hat\mu_0$ is straightforward:
\begin{align*}
\Var(\hat \mu_0) & = \Var(\bmx_0^\mT\hat{\boldsymbol\beta}) \\
&= \bmx_0^\mT \Var(\hat{\boldsymbol\beta})\bmx_0 \\
&= \bmx_0^\mT\left(\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0 \\
\end{align*}
We estimate $\Var(\hat \mu_0)$ with:
$$\widehat{\Var}(\hat \mu_0) = \bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.$$
and so the standard error is:
$$\hat{se}(\hat\mu_0) = \sqrt{\bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.}$$
This gives us the following form for the CI for the mean:
$$(\hat\mu_0 - t_{1-\alpha/2}\hat{se}(\hat\mu_0), \hat\mu_0 + t_{1-\alpha/2}\hat{se}(\hat\mu_0))$$
As before, the degrees of freedom used to calculate $t_{1 - \alpha/2}$ is $n-p$.
In `R`, we again use the `predict()` command to calculate the mean (or a prediction) and associated intervals. For the `newdata=` argument, we must now provide a data frame with the complete set of predictor variable values. Setting `interval="confidence"` will give the CI as output.
:::: {.examplebox}
```{example photo-estmean1}
What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.15, was not warmed (so `warming_treatment="ambient"`) and had a leaf temperature of 22 degrees?
```
We can find this easily using R:
```{r echo=TRUE}
pred_data <- data.frame(soil_water=0.15,
warming_treatment="ambient",
tleaf=22)
predict(ph_lm, newdata=pred_data,
interval="confidence")
```
The estimated mean photosynthesis output for trees with a soil water content ratio of 0.15, that are not warmed, and have a leaf temperature of 22 degrees is 9.48 (95\% CI 9.00, 9.96).
::::
:::: {.examplebox}
```{example}
What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.20, were warmed (so `warming_treatment="warmed"`) and had a leaf temperature of 19 degrees?
```
We can follow the same procedure as Example \@ref(exm:photo-estmean1), and in fact compute both answers at the same time by passing a data frame with multiple rows to `newdata`:
```{r echo=TRUE}
pred_data2 <- data.frame(soil_water=c(0.15, 0.2),
warming_treatment=c("ambient",
"warmed"),
tleaf=c(22, 19))
predict(ph_lm, newdata=pred_data2,
interval="confidence")
```
::::
### Prediction Intervals for New Observations
To make a prediction intervals, we extend the CI for the mean in the same way as SLR (Section \@ref(slrPI)).
The prediction interval is:
$$(\hat y_0 - t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]}, \hat y_0 + t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]})$$
The additional uncertainty in predicting a new observation, compared to a mean, can lead to quite wide prediction intervals, depending on the value of $\hat\sigma^2$.
In `R`, these can be computed by setting `interval="prediction"` inside `predict()`.
:::: {.examplebox}
```{example photo-predmean1}
What is the predicted photosynthesis output for a tree that had a soil water content ratio of 0.15, was not warmed (so `warming_treatment="ambient"`) and had a leaf temperature of 22 degrees?
```
```{r echo=TRUE}
predict(ph_lm, newdata=pred_data,
interval="prediction")
```
The predicted photosynthesis output for a tree with a soil water content ratio of 0.15, that was not warmed, and had a leaf temperature of 22 degrees is 9.48 (95\% PI -0.92, 19.89).
::::
## Exercises
```{exercise photo-beta2-inference}
In model \@ref(eq:photofitted), state in words what the null hypothesis $H_0: \beta_2 = 0$ means. Conduct the test at the $\alpha=0.05$ level and summarize your conclusiosn in one or two sentences.
```
```{exercise}
Calculate a 95\% confidence interval for $\beta_3$ in \@ref(eq:photofitted). Use the interval to test $H_0: \beta_3 = 0$ at the $\alpha=0.05$ level.
```