Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline model diagnostics plots #45

Open
mine-cetinkaya-rundel opened this issue Mar 2, 2021 · 3 comments
Open

Streamline model diagnostics plots #45

mine-cetinkaya-rundel opened this issue Mar 2, 2021 · 3 comments

Comments

@mine-cetinkaya-rundel
Copy link
Contributor

In IMS we're using residuals vs. predicted. In multiple regression chapters of this book we're also using residuals vs. predicted. We should use those (as opposed to residuals vs. x) in the simple linear chapter of this book too.

OpenIntroStat/ims#61 mentions it would be nice to keep this consistent across books.

@DavidDiez
Copy link
Collaborator

How I see these plots:

  1. Residuals vs fitted is best for revealing heteroscedasticity.
  2. Residuals vs x is best for revealing nonlinearity and other underlying relationships.
  3. In single-variable models, the methods communicate nearly identical information.
  4. In models with few variables, residuals vs x remains useful for heteroscedasticity, but residuals vs fitted is not obviously useful for identifying data structures not captured in the model.

I'm open to discussing residuals vs fitted as also being included, though given the above, the motivation for including it doesn't seem to occur until the multiple/logistic regression chapter, because its benefits (that I'm aware of, please let me know if I'm missing something) are only evident for models with several variables.

However, I'm very apprehensive about removing residuals vs x, which I think is a superior diagnostic approach. I also think residuals vs x is conceptually easier, making it preferable for a first course in statistics. The reason for that belief is that students get familiar with seeing predictors along the x-axis, and this plot only swaps out one variable in the plot (y --> residuals), so this plot shouldn't feel as new as residuals vs fitted.

@mine-cetinkaya-rundel
Copy link
Contributor Author

In practice, the switch from res vs. x to res vs. predicted feels abrupt to students and they get confused and try to come up with rules like for SLR you must use res vs. x while for MLR you must use res. predicted. So the conceptual ease (which I agree with) tends to come with the cost of a cognitive burden later.

That being said, I'm happy to keep this conversation open for OS and reflect on our experience with how we're framing things in IMS as we update OS to the next edition (which won't happen very soon anyway). I mostly wanted to file this here to not lose the thread.

@nicholasjhorton
Copy link

I think that this is a great question and I'm glad that you are discussing it. Given the increasing importance of multivariate thinking I suspect that it will be something that merits your continued attention.

I spent a fair amount of time thinking about this as I approached teaching my intro stats class this January. As you know, I dive early and fairly deep into descriptive multiple regression early on in the course then return to inferential multiple regression at the end of the course (with students undertaking projects where they analyze and interpret data from a multiple regression model).

My prior approach was to have students plot k+1 scatterplots (with superimposed line and smoother) when they had k quantitative predictors:

resid vs fitted
resid vs x_1
...
resid vs x_k

plus a histogram (with superimposed normal) of the residuals.

My experience is that they would get lost in a sea of plots and lose the forest for the trees. It was very common for every student to say "my regression assumptions aren't met, so what's the point?"

I would encourage you to focus their attention on the residuals vs. fitted plot in multiple regression land to avoid a profusion of diagnostic plots.

To help prepare them for this, I'd encourage you to note that for single-variable models, the methods communicate nearly identical information (@DavidDiez point 3 above).

Here's an example that I used for exactly that purpose.

suppressPackageStartupMessages(library(mosaic))
mod1 <- lm(cesd ~ mcs, data = HELPrct) %>%
  broom::augment()
gf_point(cesd ~ mcs, data = mod1) %>%
  gf_smooth() %>%
  gf_lm()
#> `geom_smooth()` using method = 'loess'

gf_point(.resid ~ mcs, data = mod1) %>%
  gf_smooth() %>%
  gf_lm() %>%
  gf_labs(y = "residual")
#> `geom_smooth()` using method = 'loess'

gf_point(.resid ~ .fitted, data = mod1) %>%
  gf_smooth() %>%
  gf_lm() %>%
  gf_labs(y = "residual", x = "fitted")
#> `geom_smooth()` using method = 'loess'

Created on 2021-03-03 by the reprex package (v1.0.0)

Sample description: here we demonstrate three ways to explore the linearity and equal variance assumptions for our model with a single quantitative predictor. Note that plots 1 and 2 are quite similar with the only difference being that the negative slope has been regressed out so that the best fitting straight line for the residuals as a function of the predictor is horizontal. Plot 3 replaces the predictor value with the fitted (predicted) value from the model. Since the slope is negative this flips the values (see for example the three points on the right of plot 2 are now on the lefthand side of plot 3. Plots 2 and 3 also communicate very similar information because there is only one predictor in the model.

Then add a note that for models with more than one predictor, one can also generate plots of the residuals vs. the individual predictors.

Perhaps close with a reminder that we are dealing in a multivariate world and that the model won't be perfect but we want to detect important deviations from the assumptions if we are to trust its results.

Thanks as always for your efforts on this project: it's enormously valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants