10-MLR-Inference.Rmd

# Inference in MLR {#mlrinference}

<!--- For HTML Only --->
`r if (!knitr:::is_latex_output()) '
$\\newcommand{\\E}{\\mathrm{E}}$
$\\newcommand{\\Var}{\\mathrm{Var}}$
$\\newcommand{\\bmx}{\\mathbf{x}}$
$\\newcommand{\\bmH}{\\mathbf{H}}$
$\\newcommand{\\bmI}{\\mathbf{I}}$
$\\newcommand{\\bmX}{\\mathbf{X}}$
$\\newcommand{\\bmy}{\\mathbf{y}}$
$\\newcommand{\\bmY}{\\mathbf{Y}}$
$\\newcommand{\\bmbeta}{\\bm{\\beta}}$
$\\newcommand{\\XtX}{\\bmX^\\mT\\bmX}$
$\\newcommand{\\mT}{\\mathsf{T}}$
$\\newcommand{\\XtXinv}{(\\bmX^\\mT\\bmX)^{-1}}$
'`

```{r include=FALSE}
library(tidyverse)
library(broom)
```

## What kind of hypothesis test?

In multiple linear regression, there are three types of questions we could ask about the importance of the predictor variables. They differ by whether they involve one predictor, all predictors, or a subset of predictors.


**Single Variable**  

* What is the importance of a specific predictor?  
* This question can be addressed with a T-test (Section \@ref(mlrttest)).
* Examples 
  * What is the relationship, if any, between average air pollution levels and cardiovascular mortality rates, after controlling for temperature?
  * What is the relationship, if any, between gender and salary, controlling for years of experience and position title?
  * What is the relationship, if any, between marijuana use and opioid overdoses, controlling for socioeconomic status?
  * What is the relationship, if any, between hours spent on homework and final exam score, controlling for class level and department?


**All Variables** 

* What is the overall importance of the model?   
* This question can be addressed with a global F-test (Section \@ref(mlrftest)).
* Examples:
  * How well do temperature and air pollution explain variation in cardiovascular mortality rates?
  * Do gender, years of experience, and position title explain differences in salary?
  * Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status?
  * Can we predict final exam score using time spent on homework, class level, and department?

**Subset of Variables** 

* Which subsets of predictors are important?
* This question can be addressed with a partial F-test  (Section \@ref(mlrpartialftest)). 
* Examples:
    * How well do temperature and air pollution explain variation in cardiovascular mortality rates, when accounting for underlying health status?
    * Do gender and years of experience explain differences in salary after adjusting for position title?
    * Can rates of opioid overdoses be explained by rates of marijuana use and socioeconomic status, when accounting for healthcare provider networks?
    * Can we predict final exam score using time spent on homework and year in school, when accounting for subject area?

<!-- Questions we can ask about our model: -->

<!-- 1. What is the importance of a specific predictor? (T-test) -->
<!-- 2. What is the overall importance of the model? (F-test) -->
<!-- 3. Which subsets of predictors are important? (Partial F-tests) -->

<!-- Question 1 Examples:  -->
<!-- Question 2 Examples: -->
<!-- We will first consider Question 1 (specific predictors), and then turn to Question 2 (overall significance) and Question 3 (groups of predictors). -->


## Photosynthesis Data

For examples in this chapter, we will use data on photosynthesis output in trees from Reich *et al.*, *Nature*, 2018.^[Reich, P.B., A. Stefanski, K.M. Sendall, and R.L. Rich. 2018. Photosynthetic data on experimentally warmed tree species in northern Minnesota, 2009-2011, used in the paper Reich et al Nature 2018. ver 2. Environmental Data Initiative. https://doi.org/10.6073/pasta/258239f68244c959de0f97c922ac313f] They measured photosynthesis output under different conditions, including variations in the amount of water in the soil and temperature in the surrounding air.

We can fit a multiple linear regression model for photosynthesis output ($Y$) that adjusts for soil moisture content ($x_1$), an indicator of whether the tree was artificially warmed ($x_2$),  and leaf temperature ($x_3$).^[For this example, we are ignoring some features of the design such as correlation by plot.] 


```{r eval=TRUE, echo=FALSE,  include=FALSE, message="hide", size="footnotesize"}
photo <- read_csv("data/photo.csv", col_types = cols())
photo <- subset(photo,
                   !is.na(soil_water) & !is.na(warming_treatment) &
                  !is.na(tleaf))
head(photo)
```

```{r eval=FALSE, echo=F, message="hide", warning=FALSE, fig.height=6, fig.width=8}
g1 <- ggplot(photo) +
  geom_point(aes(x=soil_water, y=photosyn, col=warming_treatment)) + 
  scale_color_discrete(name="Warming Treatment: ") + xlab("Soil Water Content Ratio") + ylab(expression("Photosynthesis Output ("~mu~"mol"/m^2/s~")"))
g2 <- ggplot(photo) +
  geom_point(aes(x=tleaf, y=photosyn, col=warming_treatment)) + 
  xlab("Leaf Temperature (deg C)") + ylab(expression("Photosynthesis Output ("~mu~"mol"/m^2/s~")"))
g12 <- plot_grid(g1 + theme(legend.position="none"),
                          g2 + theme(legend.position="none"),
                          labels = c("", ""),
                 ncol=2)
legend_b <- get_legend(g1 + theme(legend.position="bottom",legend.justification="center"))
plot_grid( g12, legend_b, nrow = 2, rel_heights = c(1, .2), align="hv")
```


```{r eval=TRUE, echo=TRUE}
ph_lm <- lm(photosyn~soil_water + warming_treatment  + tleaf,
            data=photo)
```

We obtain the fitted model:

\begin{equation}
\hat y = 3.89 + 40.5x_{1} + 1.4x_{2} -0.022x_3
(\#eq:photofitted)
\end{equation}


```{r eval=FALSE, echo=FALSE}
ph_lm <- lm(photosyn~soil_water + warming_treatment  + tleaf,
            data=photo)
summary(ph_lm)
```


<!-- ### Hypothesis Tests for $\beta_1$ in Photosynthesis data -->


## Hypothesis Tests for $\beta_j$ {#mlrttest}


### Scientific vs. Statistical Question
Using this model \@ref(eq:photofitted),  we could ask the **scientific question:**

Is there a relationship between soil water content ratio and photosynthesis output, *after adjusting for leaf temperature and warming treatment*?

To translate this into a statistical question, we need to isolate what represents the relationship between soil water content ratio and photosynthesis output. This is precisely $\beta_1$, since it represents the slope of the relationship between soil water content ($x_1)$ and average photosynthesis outtput ($E[Y]$), for trees with the same value of leaf temperature and warming treatment.
Thus, our corresponding **statistical question** is:

Is $\beta_{1}\ne 0$ in this model?


### General Form of $H_0$
The standard setup for a hypothesis test of a single coefficient parameter in MLR is similar to SLR. We consider null and alternative hypotheses of the form
\begin{equation}
H_0: \beta_j = \beta_{j0} \quad \text{vs.} \quad H_A: \beta_j \ne \beta_{j0}
(\#eq:mlrH0)
\end{equation}
One-sided hypotheses are also possible, although less common than the two-sided versions.
To test the null hypothesis in \@ref(eq:mlrH0), we use the $t$-statistic that has the same form as the one from SLR:
$$t = \frac{\hat\beta_j - \beta_{j0}}{\widehat{se}(\hat\beta_j)}$$
However unlike in SLR, the standard error in the denominator,  $se(\hat\beta_j) = \sqrt{\hat\sigma^2(\XtX)^{-1}_{jj}}$,  depends on $x_j$ and all of the other $x$'s. This means that the correlation between predictor variables can impact the results of the hypothesis test (and the width of confidence intervals).

When $H_0$ is true, $t$ follows a T-distribution with $n-p$ degrees of freedom. The corresponding $p$-value is computed in the usual way: $p = P(T> |t|)$. The $t$ statistic and $p$-value provided by R in standard output correspond to $\beta_{j0}=0$.

:::: {.examplebox}

```{example}
In the photosynthesis data, is there evidence of a relationship between soil water content ratio and average photosynthesis output, *after adjusting for leaf temperature and warming treatment*?
```

To answer this, we compute the test statistic and obtain:
$$t = \frac{\hat\beta_1 - \beta_{10}}{se(\hat\beta_1)} = \frac{40.5 - 0}{2.84} = 14.24$$
The corresponding $p$-value is:
$$P(|T_{1311}| > |14.24|) < 0.0001$$
Thus, we reject the null hypothesis that $\beta_1 = 0$ and conclude that there is a linear relationship between soil water content and photosynthesis output, when adjusting for warming treatment and leaf temperature.

All of the information necessary to conduct this hypothesis test is availble in the standard summaries of an `lm` object in R. For example:

```{r}
tidy(ph_lm)
```


::::


## Confidence Intervals for $\beta_j$


Confidence intervals for the $\beta$ parameters have the same form as in SLR:
\begin{equation}
(\hat\beta_j - t_{1-\alpha/2}\widehat{se}(\hat\beta_j), \hat\beta_j + t_{1-\alpha/2}\widehat{se}(\hat\beta_j))
(\#eq:mlrCI)
\end{equation}
where $t_{1-\alpha/2}$ is such that $P(T_{n-p} < t_{1-\alpha/2}) = 1-\alpha/2$.
The interval in \@ref(eq:mlrCI) is a random interval that, assuming the model is correct, includes the true value of the parameter $\beta_j$ with probability (1-$\alpha$).

The same functions that provide confidence intervals in R for SLR (`confint()` and `broom::tidy()`) provide them for models with multiple variables. If desired, the intervals can also be constructed from the individual components in  \@ref(eq:mlrCI).

:::: {.examplebox}

```{example}
To construct a confidence interval for $\beta_1$ in \@ref(eq:photofitted), we first compute the necessary elements:
```

* $\hat\beta_1 = 40.5$
* $t_{1 - \alpha/2} = 1.96$
* $\widehat{se}(\hat\beta_1) = 2.84$

We then compute the interval:

$$(40.5 - 1.96*2.84, 40.5 + 1.96*2.84) = (34.9, 46.1)$$

In R, we could accomplish this using:

```{r eval=FALSE, echo=TRUE}
t_alphaOver2 <- qt(p=0.975, df=nobs(ph_lm))
b1hat <- coef(ph_lm)[2]
seb1hat <- sqrt(diag(vcov(ph_lm)))[2]
c(Lower=b1hat - t_alphaOver2*seb1hat,
  Upper=b1hat + t_alphaOver2*seb1hat)
```


But in practice, it is much faster and simpler to use `tidy()`:
```{r eval=TRUE, echo=TRUE}
tidy(ph_lm, conf.int=TRUE, conf.level=0.95)
```

::::


## Testing for Significance of Regression (Global F-Test) {#mlrftest}


### Scientific vs. Statistical Question
A different **scientific question** we could ask about model \@ref(eq:photofitted) is:

Is there any linear relationship between all of these predictor variables (soil water content ratio, tree warming status, and leaf temperature) and photosynthesis output?

Our corresponding **statistical question** is:

Is $\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0$?


In words, this corresponds to:

$H_0$: There is no linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.  
$H_A$: There is a linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.

### Global F-test
<!-- This question is a **global F-test**. We can formalize this as the null and alternative hypotheses: -->
<!-- \begin{align*} -->
<!-- H_0:& \beta_1 = \beta_2 = \beta_3 = 0 \\ -->
<!-- H_A:& \beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0 -->
<!-- \end{align*} -->

The general form of this set of hypotheses, called a **Global F-Test**, is:
\begin{align*}
H_0:& \beta_1 = \beta_2 = \cdots = \beta_k = 0 \\
H_A:& \beta_j \ne 0 \text{ for at least one \textit{j}}
\end{align*}


An important limitation to the Global F-test is that $H_A$ does not specify *which* coefficient is non-zero, only that *at least 1* is non-zero. In many cases, we might follow-up a Global F-test with an additional t-test for a specific parameter.^[However, this leads to issues of multiple comparisons that need to be addressed.]

In MLR, we use the same approach to testing $H_0$ that we used in SLR (Section \@ref(slrftest)). The F statistic is 
$$f = \frac{SS_{Reg} / df_{Reg}}{SS_{Res}/ df_{Res}}.$$
As with SLR, the sum of squares decomposition is the same:
\begin{align*}
SS_{Tot} &= \sum_{i=1}^n (y_i - \overline{y})^2\\
SS_{Reg} &= \sum_{i=1}^n (\hat y_i - \overline{y})^2\\
SS_{Res} &= \sum_{i=1}^n (y_i - \hat y_i)^2\\
SS_{Tot} &= SS_{Reg} + SS_{Res}
\end{align*}
The degrees of freedom scale up and down according to the number of predictor variables. With $k$ predictor variables, $df_{Reg} = k$. Note that this is not $p=k+1$, since we do not account for the intercept here. The denominator degrees of freedom is  $df_{Res} = n - p = n - (k + 1) = n - k - 1$. When $H_0$ is true (and assuming approximate normality), $f$ follows an $F_{df_{Reg}, df_{Res}}$ distribution. We reject $H_0$ is $f$ is large enough (using the test level $\alpha$).


:::: {.examplebox}

```{example}
In the photosynthesis data, is there evidence of a linear relationship between any of the predictor variables and the average photosynthesis output? To answer this, we compute the $f$ statistic:
```

$$f = \frac{5863/3}{36803/(1315 - 4)} = 69.6$$
The corresponding $p$-value if $P(F_{3,1311} > 69.6) < 0.0001$. We reject the null hypothesis that there is no linear relationship between average photosynthesis output and the soil water content ratio, tree warming status, and leaf temperature.

R provides the $F$ statistic and $p$-value for a test of significance of regression. This is provided at the bottom of `summary()` output:
```{r echo=TRUE}
summary(ph_lm)
```


```{r echo=FALSE, eval=F}
SSres <- sum(residuals(ph_lm)^2)
SStot <- sum((photo$photosyn- mean(photo$photosyn))^2)
SSreg <- SStot - SSres
f <- (SSreg/3)/(SSres/(nobs(ph_lm)-4))
f
pf(f, df1=3, df2=nobs(ph_lm)-4, lower=F)
```

::::

### ANOVA Table

The information for the global F-test can be summarized in an ANOVA table:

| Source of Variation | Sum of Squares | Degrees of  Freedom | MS | F |
|:-----|:------|:--:|:--:|:--:|
|Regression | $SS_{reg}$ | $k$ | $MS_{reg}$ | $MS_{reg}/MS_{res}$|
|Residual | $SS_{res}$ | $n-(k+1) = n- p$ | $MS_{res}$ | -- |
|Total | $SS_{tot}$ | $n-1$ | -- | -- |


## Testing for Subsets of Coefficients {#mlrpartialftest}

### Scientific vs. Statistical Question
A different **scientific question** we could ask about model \@ref(eq:photofitted) is:

Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature  in a model that already contains soil water content?

Our corresponding **statistical question** is:

Is $\beta_1 \ne 0 \text{ and/or }\beta_2 \ne 0$?

Compared to the question for the Global F-test, we are now posing a question about a *subset* of the predictor variables. This is the same as comparing the "full" model:
$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \epsilon_i$$
and the "reduced" model
$$Y_i = \beta_0 + \beta_1x_{i1} + \epsilon_i.$$

This corresponds to the following null and alternative hypotheses:
\begin{align}
H_0:& \beta_2 = \beta_3 = 0\\
H_A:& \beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0
(\#eq:mlrpartialftest23)
\end{align}


<!-- In words, this corresponds to: -->

<!-- $H_0$: There is no linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature.   -->
<!-- $H_A$: There is a linear relationship between average photosynthesis output and soil water content ratio, tree warming status, and leaf temperature. -->


<!-- Let's now do a formal statistical test to compare: -->

<!-- \begin{align*} -->
<!-- \text{Photosyn. Output} &= \text{Soil WC}\\ -->
<!-- \text{Photosyn. Output} &= \text{Soil WC + Warming Treatment + Leaf Temp.} -->
<!-- \end{align*} -->


### Partial F-test

The hypotheses in \@ref(eq:mlrpartialftest23) can be tested using a **Partial F-test**, which calculates an $f$ statistic comparing a "full" model with all predictor variables to the "reduced" model with a subset of parameters.

<!-- The general form of this set of hypotheses, called a **Partial F-Test**, is: -->
<!-- \begin{align*} -->
<!-- H_0:& \beta_2 = \beta_3 = 0\\ -->
<!-- H_A:& \beta_2 \ne 0 \text{ and/or } \beta_3 \ne 0 -->
<!-- \end{align*} -->

The corresponding test statistic is:
\begin{equation}
f = \dfrac{\left(SS_{reg}^{Full} - SS_{reg}^{Reduced}\right)/ r}{SS_{Res}/(n - p)}
(\#eq:mlrpartialf)
\end{equation}
The numerator of $f$ is the difference in variation explained by the full model ($SS_{reg}^{Full}$) and the variation explained by the reduced model ($SS_{reg}^{Reduced}$).
This difference is scaled by the number of variables ($r$) that are being set to zero in reduced model. The denominator is, like with the Global F-test, the average amount of variability unexplained by the full model.  The $f$ statistic in \@ref(eq:mlrpartialf) is compared against an $F_{r, n-p}$ distribution to obtain the $p$-value.


Sometimes it is convenient to represent the full and reduced models compactly using vector and matrix notation. First, partition the full vector of coefficients $\boldsymbol\beta$ into two subvectors: $$\boldsymbol\beta = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{k-r} \\ \beta_{k-r+1} \\ \vdots \\ \beta_k\end{bmatrix} = \begin{bmatrix} \boldsymbol\beta_A \\ \boldsymbol\beta_B \end{bmatrix}$$
Similarly, partition $\bmX$ into two parts:
$$\bmX = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1,k-r} & x_{1, k-r + 1} & \cdots & x_{1k} \\ \vdots & \vdots & & \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \cdots & x_{n,k-r} & x_{n, k-r + 1} & \cdots & x_{nk} \end{bmatrix} = \begin{bmatrix} \bmX_A & \bmX_B \end{bmatrix}$$

Without loss of generality, we assume that the reduced model corresponds to setting $\boldsymbol\beta_B = \mathbf{0}$.^[If we want to set a different subset equal to zero, just rearrange the vector.] The Full Model is: 
$$\bmy = \bmX\boldsymbol\beta + \boldsymbol\epsilon = \bmX_A\boldsymbol\beta_A + \bmX_B\boldsymbol\beta_B + \boldsymbol\epsilon$$  
The Reduced Model is: 
$$\bmy = \bmX_A\boldsymbol\beta_A + \boldsymbol\epsilon$$
This gives us the general form of the null and alternative hypotheses for a partial F-test:
$$H_0: \boldsymbol\beta_B = \mathbf{0} \text{ vs. } H_A: \boldsymbol\beta_B \ne \mathbf{0}$$
Let $SS_{Reg}(\boldsymbol\beta)$ denote the ususal regression sum of squares from the full model and $SS_{Reg}(\boldsymbol\beta_A)$ denote the regression sum of squares from the reduced model. Then $SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)$ is the "extra sum of squares" due to $\boldsymbol\beta_B$ and we can write the $f$ statistic from \@ref(eq:mlrpartialf) as
$$f = \dfrac{(SS_{Reg}(\boldsymbol\beta) - SS_{Reg}(\boldsymbol\beta_A)) / r}{SS_{Res}/(n - p)}$$
Again, we complete the one-sided test by comparing to an $F_{r, n-p}$ distribution to obtain the $p$-value.

<!-- ### Partial F-Test -- Matrix form -->


<!-- \vspace{0.5cm} -->
<!-- \footnotesize -->
<!-- \begin{tabular}{l  c c} -->
<!-- \hline -->
<!--  & Full Model & Reduced Model \\ -->
<!--  \hline -->
<!-- Model & $\bmy = \bmX\boldsymbol\beta + \boldsymbol\epsilon$ & $\bmy = \bmX_A\boldsymbol\beta_A + \boldsymbol\epsilon$ \\ -->
<!-- $\hat{\boldsymbol\beta}$ & $\hat{\boldsymbol\beta} = (\bmX^\mT\bmX)^{-1}\bmX^\mT\bmy$ & $\hat{\boldsymbol\beta}_A = (\bmX_A^\mT\bmX_A)^{-1}\bmX_A^\mT\bmy$ \\ -->
<!-- $SS_{Res}$ & $SS_{Res}(\boldsymbol\beta) = (\bmy - \bmX\hat{\boldsymbol\beta})^\mT(\bmy - \bmX\hat{\boldsymbol\beta})$ & $SS_{Res}(\boldsymbol\beta_A) = (\bmy - \bmX_A\hat{\boldsymbol\beta}_A)^\mT(\bmy - \bmX_A\hat{\boldsymbol\beta}_A)$ \\ -->
<!-- $SS_{Reg}$ & $SS_{Reg}(\boldsymbol\beta) = SS_{Tot} - SS_{Res}(\boldsymbol\beta)$ & $SS_{Reg}(\boldsymbol\beta_A) = SS_{Tot} - SS_{Res}(\boldsymbol\beta_A)$\\ -->
<!-- \hline -->
<!-- \end{tabular} -->
<!-- \vspace{0.5cm} -->

<!-- \normalsize -->


<!-- * If $H_0$ is true ($\boldsymbol\beta_B = \mathbf{0}$), then $f \sim F_{r,n-p}$.  -->
<!-- * Reject $H_0$ if $f$ is too large. -->


To perform partial F-test in R:

* Fit the full model
* Fit the reduced model
* Use `anova(reduced_mod, full_mod)`

:::: {.examplebox}

```{example}
Is there any linear relationship between average photosynthesis output and warming treatment and leaf temperature  in a model that already contains soil water content?
```

We first fit the full model:
```{r echo=TRUE}
ph_lm <- lm(photosyn~soil_water + warming_treatment  + tleaf,
            data=photo)
```
and then the reduced model:
```{r echo=TRUE}
ph_lm_reduced <- lm(photosyn~soil_water,
            data=photo)
```

For comparing the models, we use the `anova()` command:
```{r echo=TRUE}
anova(ph_lm_reduced, ph_lm)
```
The column `RSS` gives $SS_{res}^{Red}$ and $SS_{res}^{Full}$. Since $SS_{res}^{Red} - SS_{res}^{Full} = SS_{reg}^{Full} - SS_{reg}^{Red}$, we can use these values to get the 
extra sum of squares, which is provided in the `Sum of Sq` column. The right two columns provide the value of $f$ and the $p$-value.

We reject $H_0$ and conclude that there is a linear relationship between average photosynthesis output and warming treatment and leaf temperature  in a model that already contains soil water content ($p < 0.0001$).

:::: 


:::: {.examplebox}

```{example}
Is there any linear relationship between average photosynthesis output and leaf temperature  in a model that already contains soil water content and an indicator of warming treatment?
```

For this example, our null hypothesis is $H_0: \beta_3 = 0$. If we use the partial F-test for this hypothesis, we obtain:

```{r echo=TRUE, size="footnotesize", output.lines=-1:-2}
ph_lm_reduced2 <- lm(photosyn~soil_water + warming_treatment, data=photo)
anova(ph_lm_reduced2, ph_lm)
```
We would thus fail to reject the null hypothesis. This is the exact same result we would have obtained if we had used a $t$-test. In fact, we once again have the relationship $f = t^2$, when the partial $f$-test is for setting *one* (and only one) coefficient equal to zero.

::::


## Adjusted $R^2$ {#adjr2}

In Section \@ref(r2), we introduced the coefficient of determination, $R^2$, as measure of model fit. It might be nice to use $R^2$ as a way to compare different models, but it suffers from one important drawback: adding a variable to a model  cannot decrease $R^2$.  In fact, $R^2$ will almost always go up when a variable is added. This is a mathematical fact, and it doesn't matter if the new variable is unrelated to the outcome or not.

To get around this, we can compare model using **adjusted $R^2$**:
\begin{align*}
R^2_{Adj} &= 1 - \frac{SS_{Res}/(n - p)}{SS_{Tot}/(n-1)}
\end{align*}
To identify the differences, it helps to compare $R^2_{Adj}$ to the regular version of $R^2$:
\begin{align*}
R^2 &= 1 - \frac{SS_{Res}}{SS_{Tot}}\\
\end{align*}
In $R^2_{Adj}$, the value of $SS_{Res}$ is divided by the residual degrees of freedom and $SS_{tot}$ is divided by the total degrees of freedom (minus one). If we were to add a variable that did not help explain the outcome variable, then $SS_{Res}$ would stay the same but $SS_{Res}/(n- p)$ would increase slightly. This results in a smaller value of $R^2_{Adj}$. As the sample size gets larger (i.e., as $n \to \infty$), then $R^2_{Adj}$ approaches the value of $R^2$.

It's important to note that while we can use $R_{Adj}^2$ for model comparison, it no longer represents the percentage of variability in the outcome explained by variation in the predictors. That interpretation still belongs to $R^2$.


## Estimation and Prediction of the Mean

To complete our description of inference for the MLR model, we now turn to estimation and prediction of the mean. 

Let $\bmx_0^\mT = \begin{bmatrix} 1 & x_{01} & x_{02} & \cdots & x_{0k}\end{bmatrix}$ denote a specific set of predictor variables.  
The estimated mean corresponding to these values is 
$$\hat\mu_{0} = \bmx_0^\mT\hat{\boldsymbol\beta} = \hat\beta_0 + x_{01}\hat\beta_1  + \dots + x_{0k}\hat\beta_k$$
As with SLR, this is the same value as a predicted observation, $\hat y_0 = \bmx_0^\mT\boldsymbol\beta$.


### Confidence Intervals for the Mean
To represent uncertainty, we use a confidence interval for the mean.  This requires calculating the standard error of $\hat\mu_0$. Using matrix notation, calculating the variance of $\hat\mu_0$ is straightforward:
\begin{align*}
\Var(\hat \mu_0) & = \Var(\bmx_0^\mT\hat{\boldsymbol\beta}) \\
&= \bmx_0^\mT \Var(\hat{\boldsymbol\beta})\bmx_0 \\
&= \bmx_0^\mT\left(\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0 \\
\end{align*}
We estimate $\Var(\hat \mu_0)$ with: 
$$\widehat{\Var}(\hat \mu_0) = \bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.$$
and so the standard error is:
$$\hat{se}(\hat\mu_0) = \sqrt{\bmx_0^\mT\left(\hat\sigma^2 (\bmX^\mT\bmX)^{-1}\right)\bmx_0.}$$
This gives us the following form for the CI for the mean:
$$(\hat\mu_0 - t_{1-\alpha/2}\hat{se}(\hat\mu_0), \hat\mu_0 + t_{1-\alpha/2}\hat{se}(\hat\mu_0))$$
As before, the degrees of freedom used to calculate $t_{1 - \alpha/2}$ is $n-p$.


In `R`, we again use the `predict()` command to calculate the mean (or a prediction) and associated intervals. For the `newdata=` argument, we must now provide a data frame with the complete set of predictor variable values. Setting `interval="confidence"` will give the CI as output.

:::: {.examplebox}

```{example photo-estmean1}
What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.15, was not warmed (so `warming_treatment="ambient"`) and had a leaf temperature of 22 degrees?
```

We can find this easily using R:
```{r echo=TRUE}
pred_data <- data.frame(soil_water=0.15,
                        warming_treatment="ambient",
                        tleaf=22)
predict(ph_lm, newdata=pred_data,
        interval="confidence")
```
The estimated mean photosynthesis output for trees with  a soil water content ratio of 0.15, that are not warmed, and have a leaf temperature of 22 degrees is 9.48 (95\% CI 9.00, 9.96).

:::: 

:::: {.examplebox}

```{example}
What is the estimated mean photosynthesis output for trees that had a soil water content ratio of 0.20, were warmed (so `warming_treatment="warmed"`) and had a leaf temperature of 19 degrees?
```

We can follow the same procedure as Example \@ref(exm:photo-estmean1), and in fact compute both answers at the same time by passing a data frame with multiple rows to `newdata`:
```{r echo=TRUE}
pred_data2 <- data.frame(soil_water=c(0.15, 0.2),
                        warming_treatment=c("ambient",
                                            "warmed"),
                        tleaf=c(22, 19))
predict(ph_lm, newdata=pred_data2,
        interval="confidence")
```


::::


### Prediction Intervals for New Observations
To make a prediction intervals, we extend the CI for the mean in the same way as SLR (Section \@ref(slrPI)).
The prediction interval is:
$$(\hat y_0 - t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]}, \hat y_0 + t_{1-\alpha/2}\sqrt{\sigma^2 \left[\bmx_0^\mT\left( (\bmX^\mT\bmX)^{-1}\right)\bmx_0 + 1\right]})$$
The additional uncertainty in predicting a new observation, compared to a mean, can lead to quite wide prediction intervals, depending on the value of $\hat\sigma^2$.

In `R`, these can be computed by setting `interval="prediction"` inside `predict()`.


:::: {.examplebox}

```{example photo-predmean1}
What is the predicted photosynthesis output for a tree that had a soil water content ratio of 0.15, was not warmed (so `warming_treatment="ambient"`) and had a leaf temperature of 22 degrees?
```


```{r echo=TRUE}
predict(ph_lm, newdata=pred_data,
        interval="prediction")
```

The predicted photosynthesis output for a tree with  a soil water content ratio of 0.15, that was not warmed, and had a leaf temperature of 22 degrees is 9.48 (95\% PI -0.92, 19.89).

::::

## Exercises

```{exercise photo-beta2-inference}
In model \@ref(eq:photofitted), state in words what the null hypothesis $H_0: \beta_2 = 0$ means. Conduct the test at the $\alpha=0.05$ level and summarize your conclusiosn in one or two sentences.
```

```{exercise}
Calculate a 95\% confidence interval for $\beta_3$ in \@ref(eq:photofitted). Use the interval to test $H_0: \beta_3 = 0$ at the $\alpha=0.05$ level.
```