Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

glm() in R #203

Open
DOC-fau opened this issue May 22, 2020 · 7 comments
Open

glm() in R #203

DOC-fau opened this issue May 22, 2020 · 7 comments

Comments

@DOC-fau
Copy link

DOC-fau commented May 22, 2020

This is my first issue on github. Hopefully I´m doing this right. :)

I´m currently working at my fist R-Project (analysis of people who didn´t Vote 2017). After working in SPSS I´m still a little spoiled and inexperienced. My glm() gives me this output (for simplicity are some variables omitted):

Call:
glm(formula = didVote ~ dutyVote + knowPol + trustBT + suppDemo + 
    finSit + reli + polDontCare + trustTV, family = binomial(link = "logit"), 
    data = main_df_logReg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9645  -0.3386  -0.2142  -0.1442   3.2620  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -2.21417    0.58345  -3.795 0.000148 ***
dutyVoteSTIMME EHER ZU            1.06863    0.22404   4.770 1.84e-06 ***
dutyVoteSTIMME EHER NICHT ZU      2.10750    0.26583   7.928 2.23e-15 ***
dutyVoteSTIMME GAR NICHT ZU       2.97954    0.37621   7.920 2.38e-15 ***
knowPolSTIMME EHER ZU            -0.66593    0.30260  -2.201 0.027758 *  
knowPolSTIMME EHER NICHT ZU      -1.20141    0.29884  -4.020 5.81e-05 ***
knowPolSTIMME GAR NICHT ZU       -1.74414    0.35273  -4.945 7.62e-07 ***
...
polDontCareSTIMME EHER ZU        -0.50314    0.22466  -2.240 0.025123 *  
polDontCareSTIMME EHER NICHT ZU  -0.65473    0.28136  -2.327 0.019965 *  
polDontCareSTIMME GAR NICHT ZU    0.66562    0.48578   1.370 0.170620    
trustTV2                         -0.62537    0.32471  -1.926 0.054111 .  
trustTV3                         -0.72035    0.32317  -2.229 0.025815 *  
trustTV4                         -0.43365    0.31576  -1.373 0.169648    
trustTV5                         -0.57380    0.39109  -1.467 0.142327    
trustTV6                         -0.80410    0.60836  -1.322 0.186247    
trustTVGROßES VERTRAUEN           1.16441    0.61639   1.889 0.058882 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1227.12  on 2232  degrees of freedom
Residual deviance:  882.73  on 2197  degrees of freedom
AIC: 954.73

Number of Fisher Scoring iterations: 13
Analysis of Deviance Table (Type II tests)

Response: didVote
            Df   Chisq Pr(>Chisq)    
dutyVote     3 103.182  < 2.2e-16 ***
knowPol      3  29.710  1.588e-06 ***
trustBT      6  19.173   0.003882 ** 
suppDemo     5  19.231   0.001741 ** 
finSit       4  13.648   0.008510 ** 
reli         5  20.030   0.001234 ** 
polDontCare  3  11.585   0.008947 ** 
trustTV      6  14.170   0.027794 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By a dependend variable didVote (NA´s are dropped), which depicts if one did vote in BTW 2017:

JA (Yes): 2951
NEIN (No): 518
NA's: 8

and an example out of one of my independend variables dutyVote, which depicts the approval of the statement: "In democracy it is everyones duty to vote regularly." (I´m assuming that didVote is metric (Likert-scale).) :

STIMME VOLL ZU (agree totally):     2555
STIMME EHER ZU (agree partly):     598
STIMME EHER NICHT ZU (disagree partly):     205
STIMME GAR NICHT ZU (disagree totally):     93
NA's:     26 

It might be very difficult to interpret this output without context, but my question more theoretical.
First I´d like to know, why my glm() doesn´t print every item but only those three you can see above. Is it due to missing significance of some items?
Second, I´d like to know If the following interpretation would be correct:
"The higher the approval of "In democracy it is everyones duty to vote regularly." it is more likely that one did vote 2017."_
How would you interpret this correlation, if some items are insignificant?

This is only an example of my glm(). I will probably recode some variables, because I´m not positive about the scales of measurement (e. g. dutyVote).

Many results of a previous search doesn´t use a metric level of measurement and if you want to reproduce my glm(), I could attache my r-script and the dataframe.
Thanks for any answer! :)

@FWisniewski44
Copy link

Correct me, if I‘m wrong, but I think the reason for R to display all but one of the variable levels in your output is, that if its scaling is ordinal, R automatically picks the first level of a variable as the reference category per default.

You can change this category with the contrasts-function, and I remember trying this with the mutate-function, too, so it should also be possible via the tidyverse. I did this in my seminar paper for the „Introduction to R“ - I‘m going to look this up and get back to you!

@FWisniewski44
Copy link

Alright, I've found the code I was looking for:
contrasts(var) <- contr.treatment(overall levels of var, base = level you want to set as reference)
This should also work (at least, that was what I was hoping for, when writing the seminar paper), as at that time, it seemed more intuitive to me, than contrasts did:
df <- df %>% mutate(newVar = relevel(df$oldVar, ref = 1))
This should enable you to choose the reference categories for your variables yourself!

As far as interpretation is concerned, in my MASS::polr-model for my seminar paper, I read a lot about how it is extremely difficult to interpret this kind of regression only through the numbers - maybe it would make things clearer, if you saw (via some graphics) the effect of the independent variables on your dependent variable. I also think, that I have the code for some useful graphics in my project - again I have to say, I'll get back to you asap! Hope this helps so far.

@DOC-fau
Copy link
Author

DOC-fau commented May 24, 2020

Hey, Florian.
Thanks for your reply! This idea of a reference category is genius. I´ve comlpletely forgotten about this! XD
But I think I will keep the reference category suggested by R. I believe that R picks already the category which fits the most. Usually one that is very homogenious and contains a huge deal of cases. Despite that, thank you again for your code! This could come in handy in future analysis.

About an interpretation through some plots: If you have a scatterplot it is undoubtedly convenient to interpret it, but I did a stepwise logistic Regression of my glm. Therefore R prints only the results of my glm and no plot. After that you could of couse plotting this, but it is ineviteable to interpret something like r-qrt.
Do you have a different point of view?

P.S.: Are you maybe familiar with pca/efa (ger. Faktorenanalyse)? I´ve met there another smaler issue.

@maxheld83
Copy link
Member

Thanks for the thoroughly described issue @DOC-fau and for the generous suggestions @FWisniewski44!
This is how this is supposed to work ♥️!

I think @FWisniewski44 hit the nail on the head with the reference category.

Two more pointers that you might find interesting:

  • stats::glm() as the rest of the default statistics functions have pretty arcane and inconsistent output. You might want to checkout the tidymodels ecosystem, which wraps modelling functions and provides a streamlined interface. This wouldn't necessarily provide any different statistics, but it might be technically interesting and a great toolset (which is what we focus on in this class).
  • I get a little worried when I see models (glm or otherwise) with that many independent variables. These models can have all sorts of problem (overfitting, multicollinearity, p-hacking, ...). This is a bit more of a statistics question, which we're (unfortunately) not focused on in this class. But, there are some things that you could do to get the statistical fundamentals right:
    • do lots of plots, descriptives, models, residuals. Odd things tend to show up.
    • go through all of the model requirements (normality, heteroscedasticity, etc. ...) and test/plot them with tidyverse tools
    • use some of the "fancier" ways to correct for overfitting (robustness, cross-validation, training v. validation tests, etc.)

The statistics stuff, obviously, is a huge rabbit hole, and is a bit out of scope for the class.
That said, I also urge you to really understand any model that you ever use. The kind of cookbook-statistics that is encouraged in a lot of classes (click here, do that, check that, done) is really damaging for science.

If you want to dive all in, I can't recommend Richard McElreath's Statistical Rethinking enough. There's also a free YouTube lecture series where he goes through the book.

@maxheld83
Copy link
Member

And yes @DOC-fau I actually do PCA/EFA all the time, so hit me up with any questions! In some ways, it's easier than glm, because it's not inferential statistics.

@FWisniewski44
Copy link

Hey @DOC-fau, I found the code I was looking for that did the job for me in my seminar paper and helped me a lot with interpreting my MASS::polr-output:
plot(Effect(focal.predictors = "yourVar", yourModel), main = "name of your plot")

I think this could work for you, too, if you want to visualise your findings in a cool and simple, yet clear way. As far as numbers are concerned, I used the r2_nagelkerke-function to get at least some sort of R-squared. But that was maybe just because of my limited statistical knowledge... 😅

And also thanks @maxheld83 for the advice on tidymodels and R. McElreath's book (and the advice on building/interpreting models in general!). If the content is on YouTube, too, that's always a big plus... 😄

@DOC-fau
Copy link
Author

DOC-fau commented May 26, 2020

Thank you for your input @maxheld83 and @FWisniewski44!
I will try some of the (tidymodel-)functions! Currently I´m trying to get used to analysis in R and am searchning for a kind of guideline to work with in future projecs.
Due to this I already tried to reduce my independent variables by pca a few days ago, but it is challenging overlook every relevant aspect in your analysis (and to understand how those things are working).

In my case I did a pca previously to my regression to reduce the amount of my independent variables that I suspected to have an influence on my dependent Variable (didVote). I assumed that some variables have similar effects on didVote like "trust in parliament" and "trust in government". After I got some factors through my pca I thought about making some indices like "Trust in political system" and inserted it in my glm(). Some indices and variables were significant and got good VIF-Values afterwards.

So I´m wondering if my thinking and doing is right. Especially because I got occasionally an error like "glm.fit fitted probabilities numerically 0 or 1" that vanished after I excluded a variable (age).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants