Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of Mixed Models Packages #9

Open
billdenney opened this issue Mar 21, 2021 · 40 comments
Open

Comparison of Mixed Models Packages #9

billdenney opened this issue Mar 21, 2021 · 40 comments

Comments

@billdenney
Copy link

Thank you so much for all your work on mixed models throughout the R ecosystem!

I found these pages as a great summary of how mixed models are available within R. One thing that I have not found is a resource comparing the features of the various R packages for fitting mixed models. I wonder if I'm missing it somewhere that you're aware of or if it may be a good fit here?

As a specific example, I need to fit a linear mixed model with different factor levels and heteroskedastic residual error by crossing the factor levels. (To ensure that I'm being clear here, I have treatment effects for treatments A and B measured daily on days 1 to 5 and treatment effects for treatments C and D measured daily on days 1 to 7; days are treated categorically; and I need to estimate unstructured covariance with treatment by day and a random effect on the intercept.)

I know that I can do unstructured residual error with nlme::lme() and that I can have the treament*day interaction for random effects with lme4::lmer(), but I don't see an easy way to do both. Looking at various resources that you have helpfully compiled over time, I see that I can beat lme4::lmer() into submission (😄) for the unstructured residual covariance matrix (though it's not immediately clear how to use the result) that you wrote at https://bbolker.github.io/mixedmodels-misc/notes/varmats.html.

So, the question/ask is: Do you know of a table that identifies the features of mixed effects modeling packages for R?

@bbolker
Copy link
Owner

bbolker commented Mar 21, 2021

No, although I've had a student work on something like this in the past (having trouble locating it at the moment). Maybe I should set up a google sheet so this can be a living/crowdsourced document? There is a comparison table in the glmmTMB JSS paper, but it focuses on zero-inflation.

@billdenney
Copy link
Author

It would be very helpful for me, and I assume others, too. Croudsourcing it seems like a good solution, if it doesn't already exist.

@jebyrnes
Copy link

I would contribute! And I bet the #rstats twitterverse would be all over this.

@bbolker
Copy link
Owner

bbolker commented Mar 21, 2021

OK, next question. (1) format? (github + pull requests would be fun, but a google sheet would be much more accessible and probably easier to edit - GH diffs on a CSV file might not be all that useful ...) (2) Once that's settled, I might put up a preliminary list of categories for review. There's a very out-of-date table from Bolker et al. 2009 MEE and a zero-inflation-focused one in Brooks et al 2017:
Bolker et al 2009
Brooks et al 2017

The possibilities are endless: LMM vs GLMM, REML or not, approx integration (PQL/Laplace/AGQ/Monte Carlo), deterministic or stochastic, Bayesian vs frequentist, choice of response distributions, choice of link functions, zero (and zero-one) inflation, covariance structures, crossed REs, dispersion models, multi-membership, ... (maybe we should just look at brms and list all of the things it can do ...)

@bbolker
Copy link
Owner

bbolker commented Mar 21, 2021

By the way @billdenney , the dispersion-modeling feature of glmmTMB (dispformula=...) might do what you want. It's less flexible than nlme's heteroscedasticity-handling with the weights argument/varFunc classes (which allow you to fit models for dispersion as functions of the predicted mean), but can handle any linear model with continuous or categorical predictors.

@billdenney
Copy link
Author

I do think of brms as a pretty inclusive list. My thought is to start simple-ish with the list we can readily come up with (and, I'm being generous to include myself in "we"). And then, to encourage others to add to the list where a new category makes sense adding all they know across all of the packages. And, I would assume that reaching out to the package authors would also be productive.

But, having been party to similar efforts in the past, I think that the most important parts are to make it simple and low-effort to contribute. So, I think that pull requests will likely get complicated quickly because the contributions won't be rows but will be spread throughout.

I like the Google sheets idea (though I've not administered one with open contributions before).

@jebyrnes
Copy link

Heck, this is an amazing start! Google doc-ing this and letting people add more packages, features, and something more extensive than a check, maybe, would be a great start! I'm sure the community will take it from there in wild ways.

@billdenney
Copy link
Author

billdenney commented Mar 22, 2021

I agree that adding more packages, features, and details would be useful. My initial thought of how to sketch it out would be:

  • Take some of the packages that we know better (again, including myself in "we" is very generous to myself)
    • Link to CRAN
    • Link to the website for the package (e.g. GitHub pages)
  • Make a list of the features by group
    • Estimation method: Frequentist or Bayesian, and allowing both of those, and maybe something else?
    • Linear, nonlinear (and these would be separate since some packages allow only linear, some allow both, and I guess there may be a package that only focuses on nonlinear)
    • Supported response variable distributions
    • Multivariate response variable support
      • (To me, this seems like it would be a subset of heteroskedastic residual variable support, but it may be important enough to include on its own.)
    • Residual variable distribution support
      • Extensible response variable distribution support (i.e. can the user easily write their own distribution function for the response variable)

For all of these, either indicate "no" or indicate how to use it (e.g. "See the 'foo' argument in the function 'bar()'" or "See ?foo" or "See the page https://example.com/details"), and blank would indicate that the answer is unknown.

Contributors should be able to add new packages, new feature groups (but hopefully that would be rare), and new features. The feature list would likely get pretty long as there will likely be a lot of features that are important but only supported by a single package.

And (getting way ahead of myself), if we really take this all the way, then it eventually becomes a JSS article and a "Mixed/Hierarchical Models" CRAN task view (edit: and I see that Ben has a ~6-year-old version of the CRAN task view here: https://bbolker.github.io/mixedmodels-misc/MixedModels.html).

@bbolker
Copy link
Owner

bbolker commented Mar 23, 2021

OK, one more (last?) question: should the primary format (for now) be

  • a table-like format (i.e. like the examples above: Google sheet, rows represent packages, columns represent features/feature sets. Cells could contain either yes/no check marks (possibly with footnotes), or lists of values from a semi-controlled vocabulary
  • a tag-list like (YAMLish or JSONish or something) format, e.g.
package: lme4
   links:
       CRAN: ...
       GitHub: ...
   model_types: LMM, GLMM
   estimation:
        framework: frequentist
        approach: deterministic
        GLMM_approx: Laplace, AGQ
   response_distributions:
         family: Poisson, binomial, NB2, Gamma
         links: identity, cloglog, logit, probit

(I'm just a little bit scared/paralyzed by the thought of developing this hierarchy/vocabulary completely, and I'm not crazy about YAML)

  • something more free-form?

@billdenney
Copy link
Author

billdenney commented Mar 23, 2021

To me, a table is the easiest way to receive contributions from others, so I pretty strongly lean that way.

And, it is readily extensible for new structure of input.

@billdenney
Copy link
Author

Related to glmmTMB/glmmTMB#522, it would be useful if a category were "Automatically handles rank-deficient fixed effects?"

@bbolker
Copy link
Owner

bbolker commented Mar 26, 2021

https://docs.google.com/spreadsheets/d/19itelYaVW0U0gtNtRfqh76ZGt1awlamNcJwT71u_5Uk/edit?usp=sharing

@billdenney
Copy link
Author

I made some tweaks, most notably adding a "Column Descriptions" worksheet so that what the column contains is clear to people entering data.

@jebyrnes
Copy link

This is great! Let me know when you want me to unleash the Twitterverse on it!

@billdenney
Copy link
Author

I added an instructions worksheet. Is this ready for twitterverse unleashing?

@bbolker
Copy link
Owner

bbolker commented Apr 22, 2021

Thanks for the bump/reminder. I'm feeling a little chicken/that a little bit more time spent fleshing this out/organizing this ourselves, or with a more limited group of contributors, would be good before we open it to the whole world. I will try to spend a little time on this today ...

What do you think about caution vs boldness?

@billdenney
Copy link
Author

I tend to be of the opinion that steps taken are better than steps planned.

If it will get out there with a bit more planning, then that is good. If the planning will get a better response, then that is also good. (For instance, I think that providing a completely blank worksheet and saying "twitter, go for it!" probably would not succeed.) And, if getting a few contributors on board would help flesh it out and get some initial momentum, then go for it.

But, if the outcome is likely the same, then I'd say push the baby out of the nest earlier so that it is a step taken.

@billdenney
Copy link
Author

Addendum: I'm also not likely the right person to make the push which means my opinion is less important. I've actually used twitter about 5 times ever, and I don't have a notable name in mixed models like you two do.

@bbolker
Copy link
Owner

bbolker commented Apr 23, 2021

I said I would get this today, but I didn't. Tomorrow! (I'll push it by the end of the day even if I don't do any more. I take your points about getting it done rather than getting it perfect.)

@bbolker
Copy link
Owner

bbolker commented Apr 23, 2021

@billdenney , is there a difference between my "dispersion_model" column and your "variance_heteroskedasticity" model???

The more I look at this the more I realize why I'm daunted/have never gotten around to it ...

@billdenney
Copy link
Author

@bbolker , now that you ask... probably not. Fair to remove that column.

It is definitely not a simple thing to do.

@bbolker
Copy link
Owner

bbolker commented Apr 26, 2021

One more question (I'm not stalling or anything :-) ) What do you think about adding the column descriptions/dictionary as a big (wrapped) cell at the top of the column? I know this violates the principle of keeping the column types homogeneous, but it seems it might be more accessible and convenient than having the definitions on a separate sheet (this occurred to me when I was adding & subtracting columns and having to go to two different sheets to keep everything synchronized ...)

@billdenney
Copy link
Author

Adding the description seems simpler to use, so I like it!

@bbolker
Copy link
Owner

bbolker commented May 18, 2021

still working .. .ugh

@bbolker
Copy link
Owner

bbolker commented Jun 3, 2021

OK, I'm finally going to go for it. Feel free to jump in and start populating.

@billdenney
Copy link
Author

Thanks!!!

@bbolker
Copy link
Owner

bbolker commented Jun 3, 2021

yeah, f***ing finally. Having opened a link for comments only (and telling people to request edit access), I'm now second-guessing myself and thinking I should just open it all the way up (anyone can edit with link) ... It would be nice to have people edit in the "suggestion mode" from Google docs but AFAICT that's not possible.

@jebyrnes
Copy link

jebyrnes commented Jun 3, 2021

YES!

@rogerssam
Copy link

Hi @bbolker et al., just came across this twitter thread. I would be keen to contribute to this and will send an edit request for the document. Just wondering what the criteria are for whether a package should be included or not? I.e. it's focused on R packages, so that obviously excludes things like SAS - but what of packages not on CRAN? What of R packages that require a licence fee? Is this limited to FOSS R packages?

@bbolker
Copy link
Owner

bbolker commented Jun 7, 2021

I don't think it should be limited to FOSS (we already have AS-REML in there, I think). We might want to add a column for license/FOSS-ness.

@bbolker
Copy link
Owner

bbolker commented Jun 7, 2021

@rogerssam
Copy link

Thank you. Asreml-R was exactly what I had in mind as that is what I'm most familiar with. It wasn't in the list that I could see, and so I was going to add it along with a couple of others. I think I have sent an edit request.

I think a "licence" or even just "FOSS (Y/N)" column would be really handy as it can be an important consideration. I agree with your comment about not wanting 200 columns, but what's the best way to go about suggesting/requesting extra columns in general? Discuss here or just go for it?

@billdenney
Copy link
Author

I like the FOSS (Y/N) options as I think that is what will matter to most users. If they want more detail, I think it's reasonable that they would go to the specific package's website to understand specifics of the license.

@rogerssam
Copy link

I also wonder if a "Last updated" column would be useful, mainly to indicate projects/packages that are stagnant or dead, but the downside is that it would obviously be fairly quickly out of date. Perhaps some indication of life rather than last updated? E.g. "project currently active" or similar?

@bbolker
Copy link
Owner

bbolker commented Jun 8, 2021

maybe last updated → "maturity" (I know there's standard terminology/badges for that: experimental, maturing, ???) but don't know if that's exactly the same thing ...

@rogerssam
Copy link

rogerssam commented Jun 10, 2021

Yeah, the lifecycle package is commonly used within tidyverse and those packages following tv style. This article mentions that the stages can be used to describe packages, functions or even arguments, however they are most commonly applied to functions.

Using that maturity terminology would be familiar to many users, but would probably require regular updating. I was trying to think of an objective way to indicate life of a package without requiring regular checking and updating. E.g. pedigreemm is one I've looked at in the past and hasn't been updated since 2014. It works, but should it be classed as stable, deprecated or superseded?

I wonder if there's a way to grab the last published date on CRAN automatically in the sheet? Would that be useful if doable?


Edit: FWIW I managed to get a Google Sheets formula to grab this info from CRAN packages if that's useful:

image

This is done in a test version of my own, not the master copy.

@billdenney
Copy link
Author

Thanks for making this happen! I used it today to find a package that supported censoring.

@bbolker
Copy link
Owner

bbolker commented Jul 4, 2021

@rogerssam "last published" sounds like a good idea. How does it work? Seems fancy for a google sheets formula (web lookup & scrape/text processing)?

@rogerssam
Copy link

rogerssam commented Jul 4, 2021

@bbolker yeah it grabs the URL, and then the CRAN details are just a table, so can be imported and I realised it could be parsed using XPath to return just the published date, or an Excel #N/A when not a CRAN URL.

The formula is just =INDEX(IMPORTXML(C25,"//table[1]//tr[td//text()[contains(., 'Published:')]]"), 2), where C25 is the URL.

I don't know if it's dynamic (i.e. it updates automatically on sheet load), or if it requires some intervention, but I will investigate some more, and see what I can do to make it dynamic if needed.

I'll add it to the sheet, but feel free to move as appropriate.

@bbolker
Copy link
Owner

bbolker commented May 13, 2024

Note that this is way out of date now. There is now a mixed models task view which provides much of this information, although not in a tabular format and not necessarily with quite as much detail ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants