-
Notifications
You must be signed in to change notification settings - Fork 0
/
lessr_wellbeing_out.Rmd
225 lines (157 loc) · 18.6 KB
/
lessr_wellbeing_out.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
output:
html_document:
fig_height: 4.5
fig_width: 5.5
title: "Regression of Well.being on Working.hours"
---
***
_Sun Dec 19, 2021 at 12:11 with lessR version 4.1.4_
_Output Options: explain=TRUE, interpret=TRUE, results=TRUE, document=TRUE, code=TRUE_
***
```{r echo=FALSE}
suppressPackageStartupMessages(library(lessR)) # lessR
```
The variable of primary interest is the _response variable_, Well.being. The purpose of this analysis is to account for the values of Well.being in terms of the values of the _predictor variable_ Working.hours.
## The Data
Read the data with the `lessR` function `Read`.
The corresponding data values for the variables in the model comprise the _training data_, from which the model is estimated.
```{r}
d <- Read(from = "./data/wellbeing.xlsx")
```
Data from the following variables are available for analysis: `r xAnd(names(d))`.
## The Model
### Specified Model
Express Well.being as a linear function of one predictor variable: Working.hours.
Within the context of the model, indicate the response variable with a Y, subscripted by the variable name, $Y_{Well.being}$. Write each predictor variable as a subscript to an X. From the training data compute $\hat Y_{Well.being}$, the _fitted value_ of the response variable from the model for a specific set of values for Working.hours.
$$\hat Y_{Well.being} = b_0 + b_1 X_{Working.hours}$$
The _intercept_, $b_0$, indicates the fitted value of Well.being, for values of Working.hours all equal to zero.
The _slope coefficient_, $b_1$ , is the average change in the value of response variable, Well.being, for a one-unit increase in the value of the corresponding predictor variable. The values of these estimated coefficients only apply to the interpretation of the training data from which they were estimated.
To compute $\hat Y_{Well.being}$ from a specific set of values for Working.hours requires the estimated values of the coefficients of the model, the values of each regression coefficient, $b_j$. This estimation procedure depends on the _residual_, the difference between the actual value of Well.being for each row of data and the corresponding value fitted by the model. Define the residual as a variable across all rows of data. Use the subscript _i_ to indicate the $i^{th}$ row of data to emphasize that the expression applies to _each_ row of training data. The name of the response variable in this notation is understood and so is omitted for simplicity.
$$e_i = Y_i - \hat Y_i$$
Estimate the coefficients with ordinary least squares (OLS), which provides the one set of estimates that minimize the sum of the squared residuals, $\sum e^2_i$, across all the rows of the training data.
Accomplish the estimation and related computations with the `lessR` function `Regression`, abbreviated as `reg`. Keep graphics separate, so generate these later.
```{r}
r <- Regression(my_formula = Well.being ~ Working.hours, digits_d = 2, graphics = FALSE)
```
The output begins with a specification of the variables in the model and a brief description of the data.
```{r}
r$out_background
```
Of the `r r$n.obs` cases presented for analysis, `r r$n.keep` are retained, so the number of deleted cases due to missing data is `r r$n.obs - r$n.keep`.
### Estimated Model
The analysis of the model begins with the estimation of each sample regression coefficient, $b_j$, from the training data. Of greater interest is each corresponding population value, $\beta_j$, in the _population model_.
$$\hat Y_{Well.being} = \beta_0 + \beta_1 X_{Working.hours}$$
The associated inferential analyses for each estimate are the hypothesis test and confidence interval.
Each _t_-test evaluates the _null hypothesis_ that the corresponding _individual_ population regression coefficient is 0, here for the $j^{th}$ coefficient.
$$H_0: \beta_j=0$$
$$H_1: \beta_j \ne 0$$
The confidence interval provides the range of likely values of the corresponding $\beta_j$. Each 95% confidence interval is the margin of error on either side of the corresponding estimated intercept or slope coefficient, $b_j$.
```{r}
r$out_estimates
```
This estimated model is the linear function with estimated numeric coefficients that yield a fitted value of Well.being from the provided data value of Working.hours.
$$\hat Y_{Well.being} = `r xP(r$coefficients[1],2)` `r ifelse(sign(r$coefficients)==1, "+", "-")[2]` `r xP(abs(r$coefficients[2]),2)` X_{Working.hours}$$
This predictor variable has a _p_-value less than or equal to $\alpha$ = 0.05: _`r xAnd(names(which(r$pvalues[2:length(r$pvalues)] <= 0.05)))`_.
To extend the results beyond this sample to the population from which the sample was obtained, interpret the meaning of this corresponding coefficient in terms of its confidence interval.
With 95% confidence, for each additional unit of Working.hours, on average, Well.being changes somewhere between `r xP(r$cilb[2],2,)` to `r xP(r$ciub[2],2,)`.
## Model Fit
An estimated model is not necessarily useful. A first consideration of usefulness is that the model fits the data from which it is estimated, the training data. To what extent do the values fitted by the model, $\hat Y_{Well.being}$, match the corresponding training data values $Y_{Well.being}$? Are the individual residuals, $e_i$, typically close to their mean of zero, or are they scattered about the regression line with relatively large positive values and negative values? There are more considerations of usefulness, but a model that cannot fit its own training data is generally not useful.
### Partitioning of Variance
The analysis of fit depends on the adequacy of the model to account for the variability of the data values of Well.being, expressed in model notation as $Y_{Well.being}$. The core component of variability is the _sum of squares_, short for the sum of some type of squared deviations. The _total variability_ of $Y_{Well.being}$ depends on the deviations of its data values from its mean, $Y_{Well.being} - \bar Y_{Well.being}$, and then the resulting sums of squares, $SS_{Well.being}$.
The analysis of the residuals, $e = Y_{Well.being} - \hat Y_{Well.being}$, follows from the corresponding sum of squares, the value minimized by the least squares estimation procedure, $\sum e^2_i$ = $SS_{Residual}$. This residual sum of squares represents variation of $Y_{Well.being}$ _not_ accounted for by $\hat Y_{Well.being}$. The complement to the residual sum of squares is the Model (or Regression) sum of squares, the deviations of the fitted values about the mean, $\hat Y_{Well.being} - \bar Y_{Well.being}$.
The analysis of variance (ANOVA) partitions this total sum of squares into the residual variability, $\sum e^2_i$, and the Model sum of squares, $SS_{Model}$. The ANOVA table displays these various sources of variation.
```{r}
r$out_anova
```
$$SS_{Well.being} = SS_{Model} + SS_{Residual} = `r xP(r$anova_model["ss"],2)` + `r xP(r$anova_residual["ss"],2)` = `r xP(r$anova_total["ss"],2,, semi=TRUE)` $$
This decomposition of the sums of squares of Well.being into what is explained by the model, and what is not explained, is fundamental to assessing the fit of the model.
### Fit Indices
From the ANOVA two types of primary indicators of fit are derived: standard deviation of the residuals and several $R^2$ statistics.
```{r}
r$out_fit
```
The _standard deviation of the residuals_, $s_e$, directly assesses the variability of the data values of Well.being about the corresponding fitted values for the training data, the particular data set from which the model was estimated. Each mean square in the ANOVA table is a variance, a sum of squares divided by the corresponding degrees of freedom, _df_. By definition, the standard deviation, $s_e$ is the square root of the mean square of the residuals.
$$s_e = \sqrt{MS_{Residual}} = \sqrt{`r xP(r$anova_residual["ms"],2)`} = `r xP(r$se,2,, semi=TRUE)`$$
To interpret $s_e$ = `r xP(r$se,2,)`, consider the estimated range of 95% of the values of a normally distributed variable, which depends on the corresponding 2.5% cutoff from the $t$-distribution for df=`r r$anova_residual["df"]`: `r xP(-qt(0.025, df=r$anova_residual["df"]),3)`.
$$95\% \; Range: 2 * t_{cutoff} * s_e = 2 * `r xP(-qt(0.025, df=r$anova_residual["df"]),3)` * `r xP(r$se,2)` =
`r xP(r$resid_range,2,, semi=TRUE)`$$
This range of the residuals for the fitted values is the lower limit of the range of prediction error presented later.
A second type of fit index is $R^2$, the proportion of the overall variability of response variable Well.being that is accounted for by the model, applied to the training data, expressed either in terms of $SS_{Residual}$ or $SS_{Model}$.
$$R^2 = 1 - \frac{SS_{Residual}}{SS_{Well.being}} = \frac{SS_{Model}}{SS_{Well.being}} = \frac{`r xP(r$anova_model["ss"],2)`} {`r xP(r$anova_total["ss"],2)`} = `r xP(r$Rsq,3)` $$
Unfortunately when any new predictor variable is added to a model, useful or not, $R^2$ necessarily increases. Use the adjusted version, $R^2_{adj}$, to more appropriately compare models estimated from the same training data with different numbers of predictors. $R^2_{adj}$ helps to avoid overfitting a model because it only increases if a new predictor variable added to the model improves the fit more than would be expected by chance. The adjustment considers the number of predictor variables relative to the number of rows of data (cases). Accomplish this adjustment with the degrees of freedom, to transform each Sum of Squares to the corresponding Mean Squares_
$$R^2_{adj} = 1 - \frac{SS_{Residual} \; / \; `r r$anova_residual["df"]`}
{SS_{Well.being} \; / \; `r r$anova_total["df"]`} = 1 - \frac{MS_{Residual}}{MS_{Well.being}} = 1 - \frac{`r xP(r$anova_residual["ms"],2)`} {`r xP(r$anova_total["ms"],2)`} = `r xP(r$Rsqadj,3)`$$
From this analysis compare $R^2$ = `r xP(r$Rsq,3)` to the adjusted value of $R^2_{adj}$ = `r xP(r$Rsqadj,3)`, a difference of `r xP((r$Rsq-r$Rsqadj), 3)`. A large difference indicates that too many predictor variables in the model for the available data yielded an overfitted model.
Both $R^2$ and $R^2_{adj}$ describe the fit of the model to the training data. To generalize to prediction accuracy on _new_ data, evaluate the fit of the model to predict using the _predictive residual_ (PRE). To calculate the predictive residual for a row of data (case), first estimate the model with that case deleted, that is, from all the remaining cases in the training data, an example of a _case-deletion_ statistic. Repeat for all rows of data. $SS_{PRE}$, or PRESS, is the sum of squares of all the predictive residuals in a data set. From $SS_{PRE}$ define the predictive $R^2$, $R^2_{PRESS}$.
$$R^2_{PRESS} = 1 - \frac{SS_{PRE}}{SS_{Well.being}} = 1 - \frac{`r xP(r$PRESS,2)`} {`r xP(r$anova_total["ss"],2)`} = `r xP(r$RsqPRESS,3)` $$
Because an estimated model at least to some extent overfits the training data, $R^2_{PRESS}$ = `r xP(r$RsqPRESS,3)` is lower than both $R^2$ and $R^2_{adj}$. The value is lower, but is the more appropriate value to understand how well the model predicts new values beyond the training from which it was estimated.
Compare two nested models with the `lessR` function `Nest`. Specify the response variable Well.being, the variables in the reduced model, and then the additional variables in the full model. `Nest` also ensures to compare the same data values when there is missing data that might otherwise leave more data in the analysis of the reduced model.
`r reject <- "Reject the null hypothesis of the tested regression coefficients equal to 0 because of the small _p_-value of"`
`r accept <- "No difference of the coefficients from zero detected because of the relatively large _p_-value of"`
## Relation Between Well.being and Working.hours
How do the variables in the model relate to each other? The correlation of response variable Well.being with predictor variable Working.hours should be high.
The correlation of Well.being with Working.hours in the training data is $r$ = `r xP(r$cor[2,1],3)`.
Visually summarize the relationship of Well.being and Working.hours in the model with the scatterplot.
Plot the scatter plot separately with the `lessR` function `regPlot`. Specify option 1 to indicate this specific plot.
```{r}
regPlot(r, 1, pred.intervals=FALSE) # 1: scatter plot
```
## Analysis of Residuals and Influence
Values of Well.being fitted by the estimated model do not generally equal the corresponding data values_ Which cases (rows of data) contribute the most to this lack of fit?
The identification of cases that have a large residual and/or undue influence on the estimation of the model helps detect potential outliers. For each case, in addition to the data values, fitted value and corresponding residual, the analysis provides the following values:
* _residual_: Value of the response variable Well.being minus its fitted value, $e = Y_{Well.being} - \hat Y_{Well.being}$
* _rstudent_: Externally Studentized residual, standardized value of the residual from a model estimated without the case present
* _dffits_: Standardized difference between a fitted value with and without the case present
* _cooks_: Cook's Distance, the aggregate influence of the case on all the fitted values with each fitted value calculated with the case deleted
```{r}
r$out_residuals
```
From this analysis the five largest Cook's distances: `r xP(r$resid.max[1],2)`, `r xP(r$resid.max[2],2)`, `r xP(r$resid.max[3],2)`, `r xP(r$resid.max[4],2)` and `r xP(r$resid.max[5],2)`.
An informal criterion for evaluating the size of Cook's distance is a cutoff value of 1 to indicate too large of a large size.
No cases have a Cook's distance larger than 1 in this analysis.
## Prediction Intervals
Ultimately the analysis moves beyond the training sample. Prediction is from _new_ data values of Working.hours, what may be called the _prediction sample_. Applying these new data values to the estimated model yields the predicted values. For data values from the training sample, the fitted value and predicted value are the same, calculated from the same estimated model, but are different concepts with different interpretations.
Unfortunately, prediction is not perfect. The range of values likely to contain the actual data value for Well.being predicted from specific values of Working.hours quantifies the _prediction error_. The standard deviation of the residuals, $s_e$, assumed to be the same for all sets of values of the predictor variables, specifies the _modeling error_ of the fitted values from the training data, error due to imperfections in the model. However, for predictions of future values of Well.being, new data are collected. So sampling error of a value on the regression line, indicated with $s_{\hat Y}$, must also be considered in the assessment of prediction error. Consideration of both sources of error results in the _standard error of prediction (or forecast)
$$s_f = \sqrt{s^2_e + s^2_{\hat Y}}$$
Unlike modeling error, the amount of sampling error varies depending on the values of the predictor variables, so each row of data has its own value, $s_f$. Each prediction interval is the margin of error, the _t_-cutoff multiplied by the corresponding $s_f$, added and subtracted on either side of $\hat Y_{Well.being}$.
### Prediction Intervals from the Training Data
The analysis provides each row of data values, _as if_ they were new data, with a predicted value based on the model estimated from the training data, as well as the standard error of prediction. From these values obtain the lower and upper bounds of the corresponding 95% prediction interval. By default, only the first three, middle three and last three rows of data are presented, sufficient to indicate the ranges of prediction error encountered throughout the ranges of data values
```{r}
r$out_predict
```
The size of the prediction intervals for the range of data found in the input data table vary from a minimum of `r xP(r$pred_min_max[1], 2,)` for `r xAnd(xRow(r$pred_min_max[1]))` to a maximum of `r xP(r$pred_min_max[2], 2,)` for `r xAnd(xRow(r$pred_min_max[2]))`.
The confidence intervals for the points on the regression line, and the much larger prediction intervals for the individual data points, are illustrated with an enhancement of the original scatter plot.
Plot the scatter plot with prediction intervals separately with the `lessR` function `regPlot`. Specify option 1 to indicate this specific plot.
```{r}
regPlot(r, 1) # 1: scatter plot with prediction intervals
```
### Prediction Intervals from New Data
New data values from which to obtain a prediction, different from the training data, can be entered with the options X1_new, X2_new, up to X6_new, where each option name refers to the position of the corresponding predictor variable in the specification of the regression model. Any number of values can be specified for each predictor variable. Suppose, for example, that there are two values of interest for the predictor variable from which to make a prediction, listed below.
Working.hours: 30, 33
Re-run the analysis to obtain the prediction intervals with these specified values_
```{r}
r <- Regression(my_formula = Well.being ~ Working.hours, digits_d = 2,
X1_new=c(30,33),
graphics = FALSE)
```
Calculate the piction intervals only for the new data values_
```{r}
r$out_predict
```
The size of the prediction intervals for the range of data found in the newly specified values vary from a minimum of `r xP(r$pred_min_max[1], 2,)` for `r xAnd(xRow(r$pred_min_max[1]))` to a maximum of `r xP(r$pred_min_max[2], 2,)` for `r xAnd(xRow(r$pred_min_max[2]))`. The rows in the output display, however, are re-ordered according to the combinations of the ordered values of the predictor variables.
## Model Validity
The residuals should be independent, normal random variables with a mean of zero and constant variance.
### Distribution of Residuals
For the inferential analyses to be valid, the residuals should be normally distributed.
Violation of normality does not bias the estimates, but it does render the inferential tests invalid.
Plot the distribution of residuals separately with the `lessR` function `regPlot`. Specify option 2 to indicate this specific plot.
```{r}
regPlot(r, 2) # 2: distribution of residuals
```
### Fitted Values vs Residuals
The residuals should represent random variation, free of any pattern or structure. They should satisfy the _homoscedasticity_ assumption, randomly scattered about 0, with approximately the same level of variability across the range of the fitted values within a horizontal band around the zero-line. Otherwise they demonstrate _heteroskedasticity_.
Plot the scatter plot of fitted values with residuals separately with the `lessR` function `regPlot`. Specify option 3 to indicate this specific plot.
```{r}
regPlot(r, 3) # 3: scatter plot of fitted with residuals
```