Modern scientific research often aims to quantify the importance of multiple processes in natural or human systems. Some examples from my own work in ecology and evolution consider the effects of herbivory and fertilization on standing biomass [@Gruner+2008]; the effects of bark, wood density, and fire on tree mortality [@brando_fire-induced_2012]; or the effects of taxonomic and genomic position on evolutionary rates [@ghenu_multicopy_2016]. This multifactorial approach [@mcgill_why_2016] complements, rather than replacing, the traditional hypothesis-testing or strong-inferential framework [@platt_strong_1964;@fox_why_2016;@betini_why_2017].^[While there is much interesting debate over the best methods for gathering evidence to distinguish among two or more particular, intrinsically discrete hypotheses [@taper_evidential_2015], that is not the focus of this paper.]

A standard approach to analyzing multifactorial systems, particularly common in ecology, goes as follows: (1) Construct a full model that encompasses as many of the processes (and their interactions) as is feasible. (2) Fit the full model and make sure that it describes the data reasonably well (e.g. by computing $R^2$ values or estimating the degree of overdispersion). (3) Construct possible submodels of the full model by setting subsets of parameters to zero. (4) Compute information-theoretic measures of quality, such as the Akaike or Bayesian/Schwarz information criteria, for every submodel. (5) Use multi-model averaging (MMA) to estimate model-averaged parameters and confidence intervals (CIs); possibly draw conclusions about the importance of different processes by summing the information-theoretic weights [@burnham_model_2002]. I argue that this approach, even if used sensibly as advised by proponents of the approach (e.g. with reasonable numbers of candidate submodels), is a poor way to approach multifactorial problems.

The goal of a multifactorial analysis is to tease apart the contributions of many processes, all of which we believe are affecting our study system to some degree. If our scientific questions are (something like) "How important is this factor, in an absolute sense or relative to other factors?", rather than "Which of these factors are having any effect at all on my system?", why are we working so hard to fit many models of which only one (the full model) incorporates all of the factors? If we do not have particular, a priori discrete hypotheses about our system (such as "process $A$ influences the outcome but process $B$ has no effect at all"), why does so much of our data-analytic effort go into various ways to test between, or combine and reconcile, multiple discrete models? In software development, this is called an "XY problem"^[http://www.perlmonks.org/?node=XY+Problem]: rather than thinking about the best way to solve our real problem $X$ (understanding multifactorial systems), we have gotten bogged down in the details of how to make a particular tool, $Y$ (multimodel approaches) provide the answers we need. Most critiques of MMA address technical concerns such as the influence of unobserved heterogeneity [@brewer_relative_2016], or criticize the misuse of information-theoretic methods by researchers [@cade_model_2015], but do not ask why we are comparing discrete models in the first place.

In contrast to averaging across discrete hypotheses, or treating a choice of discreting hypotheses as an end goal, fitting multiple models as a step in a null-hypothesis significance testing (NHST) procedure is defensible. While much maligned, NHSTs are a useful part of data analysis --- not to decide whether we really think a null hypothesis is false (they almost always are), but to see if we can distinguish signal from noise. Another interpretation is that NHSTs can test whether we can reliably determine the direction of effects --- that is, not whether the effect of a predictor on some process is zero, but whether we can tell unequivocally that has a particular sign, positive or negative [@jones_sensible_2000; @dushoff_i_2019]. We perform these tests by statistically comparing a full model to a reduced model that pretends the effect is exactly zero.

However, researchers using multimodel approaches are not fitting one-step-reduced models to test hypotheses; rather, they are fitting a wide range of submodels, typically in the hope that model choice or multimodel averaging will help them deal with insufficient data in a multifactorial world. If we had enough information (even Big Data doesn't always provide the information we need: @mengStatistical2018), we could fit just the full model, drawing our conclusions from the estimates and CIs with all of the factors considered simultaneously. But we nearly always have too many predictors, and not enough data; we don't want to overfit (which will inflate our CIs and $p$-values to the point where we can't tell anything for sure), but at the same time we are afraid of neglecting potentially important effects.

Stepwise regression, the original strategy for separating signals from noise, is now widely deprecated because it interferes with correct statistical inference [@harrell_regression_2001;@whittingham_why_2006]. Information-theoretic tools mitigate the instability of stepwise approaches, allow simultaneous comparison of many, non-nested models, and avoid the stigma of NHST. A further step forward, multi-model averaging [@burnham_model_2002], accounts for model uncertainty and avoids focusing on a single best model. Some forms of model averaging provide simple shrinkage estimators; averaging the strength of effects between models where they are included and models where they are absent adjusts the estimated effects toward zero [@cade_model_2015]. More recently, model averaging is experiencing a backlash, as studies point out that multimodel averaging may run into trouble when variables are collinear (@freckleton_dealing_2011 [but cf. @walker_defense_2017]); when we are careless about the meaning of main effects in the presence of interactions; when we average model parameters rather than model predictions [@cade_model_2015]; or when we use summed model weights to assess the relative importance of predictors (@galipaud_ecologists_2014 [but cf. @zhang_model_2015]).

In ecology, information criteria were introduced by applied ecologists who were primarily interested in making the best possible predictions to inform conservation and management; they were less concerned with inference or quantifying the strength of underlying processes [@BurnAnde98;@burnham_model_2002;@johnsonModel2004a]. Rather than using information criteria as tools to identify the best predictive model, or to obtain the best overall (model-averaged) predictions, most current users of information-theoretic methods use them either to quantify variable importance, or, by multimodel averaging, to have their cake and eat it too --- to avoid either over- or underfitting while quantifying effects in multifactorial systems. These researchers encounter two problems, one conceptual and one practical.

The conceptual problem with model averaging reflects the original sin of unnecessarily discretizing a continuous world. Suppose we want to understand the effects of temperature and precipitation on biodiversity. The model-comparison or model-averaging approach would construct five models: a null model with no effects of either temperature or precipitation, two single-factor models, an additive model, and a full model allowing for interactions between temperature and precipitation. We would then fit all (or many) of these models and then model-average their parameters. We might be doing this in an effort to get good predictions, or to to test our confidence that we know the signs of particular effects (measured in the context of whatever processes are included in the reduced and the full models), but they are only means to an end, and we shouldn't fool ourselves into thinking that we are using the method of multiple working hypotheses. For example, Chamberlin (1897, reprinted as @raup_method_1995) argued that in teaching about the origin of the Great Lakes we should urge students "to conceive of three or more great agencies [pre-glacial erosion, glacial erosion, crust deformation] working successively or simultaneously, and to estimate how much was accomplished by each of these agencies." Chamberlin was not suggesting that we test which individual mechanism or combination of mechanisms fits the data best (in whatever sense), but instead that we acknowledge that the world is multifactorial.

The technical problem with model averaging is its computational inefficiency. Individual models can take minutes or hours to fit, and we may have to fit dozens or scores of sub-models in the multi-model averaging process. There are efficient tools available for fitting "right-sized" models that avoid many of the technical problems of model averaging. Penalized methods such as ridge and lasso regression [@dahlgren_alternative_2010] are well known in some scientific fields; in a Bayesian setting, informative priors centered at zero have the same effect of regularizing --- pushing weak effects toward zero and controlling model complexity (more or less synonymous with the shrinkage of estimates described above) [@lemoineMoving2019a]. Developed for optimal (predictive) fitting in models with many parameters, penalized models have well-understood statistical properties; they avoid the pitfalls of model-averaging correlated or nonlinear parameters; and, by avoiding the need to fit many sub-models in the model-averaging processes, they are much faster.^[Although they often require a computationally expensive cross-validation step in order to choose the degree of penalization.]

Here I am not concerned whether 'truth' is included in our model set (it isn't), and how this matters to our inference [@bernardoBayesian1994; @barker_truth_2015]. I am claiming the opposite, that our full model is usually as close to truth as we can get; we don't really believe any of the less complex models. If we are trying to get the best predictions, or to compare the strength of various processes in a multifactorial context, there may be better ways to do it. In situations where we really want to compare qualitatively different, non-nested hypotheses [@luttbeg_comparing_2004], AIC or BIC or any appropriate model-comparison tool is fine; however, if the models are really qualitatively different, perhaps we shouldn't be trying to merge them by averaging, unless prediction is our only goal.

Penalized models have their own challenges. A big advantage of information-theoretic methods is that, like wrapper methods for feature selection in machine learning [@chandrashekarSurvey2014], we can use model averaging as long as we can fit component models and extract the log-likelihood and number of parameters --- we never need to build new software. Although powerful computational tools exist for fitting penalized versions of linear and generalized linear models (e.g. the glmnet package for R) and mixed models (glmmLasso), software for some more exotic models (e.g. zero-inflated models, quantile regressions, models for censored data) may not be readily available. Fitting these models requires the user to choose the degree of penalization. This process is conveniently automated in tools like glmnet, but correctly assessing out-of-sample accuracy (and hence the correct level of penalization) is tricky for data that are correlated in space or time [@wenger_assessing_2012; @robertsCrossvalidation2016].

Finally, frequentist inference (computing $p$-values and CIs) for parameters in penalized models --- one of the basic outputs we want from a statistical analysis of a multifactorial system --- is a current research problem; statisticians have proposed a variety of methods [@potscherConfidence2010a;@javanmard_confidence_2014;@lockhart_significance_2014;@taylorPostselection2018], but they are far from being standard options in software. Scientists should encourage their friends in statistics and computer science to build tools that make penalized approaches easier to use.

Statisticians derived confidence intervals for ridge regression long ago [@obenchain_classical_1977] --- surprisingly, they are identical to the confidence intervals one would have gotten from the full model without penalization! @wangInterval2013b similarly proved that model-averaging CIs derived as suggested by @hjortFrequentist2003 are asymptotically (i.e. for arbitrarily large data sets) equivalent to the CIs from the full model. Analytical and simulation studies [@turek2012model;@fletcher2012model;@turek2013frequentist;@turek2015comparison;@kabaila_model-averaged_2016;@dormann_model_2018] have shown that a variety of alternative methods for constructing CIs are overoptimistic, i.e. that they generate too-narrow confidence intervals with coverage lower than the nominal level. Simulations from several of the studies above show that MMA confidence intervals constructed according to the best known procedures typically include the true parameter values only about 80% or 90% of the time. In particular, @kabaila_model-averaged_2016 say that constructing CIs that take advantage of shrinkage but still achieve correct coverage will be very difficult to achieve using model averaged confidence intervals. (The only examples I have been able to find of MMA confidence intervals with close to nominal coverage are from Chapter 5 of @burnham_model_2002.) In short, it seems difficult to find model-averaged confidence intervals that compete successfully with the standard confidence interval based on the full model.

Free lunches do not exist in statistics, any more than anywhere else. We can use penalized approaches to improve prediction accuracy without having to sacrifice any input variables (by trading bias for variance), but the only known way to gain statistical power for testing hypotheses, or narrowing our uncertainty about our predictions, is to limit the scope of our models a priori [@harrell_regression_2001], to add information from pre-specified Bayesian priors (or equivalent regularization procedures), or to collect more data. @burnhamMultimodel2004b defined a "savvy" prior that reproduces the results of AIC-based multimodel averaging in a Bayesian framework, but it is a weak conceptual foundation for understanding multifactorial systems. Because it is a prior on discrete models, rather than on the magnitude of continuous parameters that describe the strength of different processes, it induces a spike-and-slab type marginal prior on parameters that assigns a positive probability to the unrealistic case of a parameter being exactly zero; furthermore, the prior will depend on the particular set of models being considered.

Multimodel averaging is probably most popular in ecology (Google Scholar returns $\approx$ 60,000 hits for "multimodel averaging" alone and 30,000 for "multimodel averaging ecology"). However, multifactorial systems --- and the problems of approaching inference through comparing and combining discrete models that consider artificially limited subsets of the processes we know are operating --- occur throughout the sciences of complexity, those involving biological and human processes. In psychology, economics, sociology, epidemiology, ecology, and evolution, every process that we can imagine has some influence on the outcomes that we observe. Pretending that some of these processes are completely absent can be a useful means to an inferential or computational end, but it is (almost) never what we actually believe about the system. We should not let this useful pretense become our primary statistical focus.

If we have sensible scientific questions and good experimental designs, muddling through with existing techniques will often give reasonable results [@murtaugh_performance_2009]. But researchers should at least be aware that the roundabout statistical methods they currently use to understand multifactorial systems were designed for prediction, or the comparison of discrete hypotheses, rather than for quantifying the relative strength of simultaneously operating processes. When prediction is the primary goal, penalized methods can work better (faster and with better-understood statistical properties) than multimodel averaging. When estimating the magnitude of effects or judging variable importance, penalized or Bayesian methods may be appropriate --- or we may have to go back to the difficult choice of focusing on a restricted number of variables for which we have enough to data to fit and interpreting the full model.

Acknowledgements

Thanks to Jonathan Dushoff for conversations on these topics over many years. Dana Karelus, Daniel Turek, and Jeff Walker provided useful input: Noam Ross encouraged me to finally submit the paper; Tara Bolker gave advice on straw men. This work was supported by multiple NSERC Discovery grants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

body.md

body.md

Acknowledgements

References

Files

body.md

Latest commit

History

body.md

File metadata and controls

Acknowledgements

References