Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor predictInterval #106

Open
jknowles opened this issue Jun 1, 2019 · 2 comments
Open

Refactor predictInterval #106

jknowles opened this issue Jun 1, 2019 · 2 comments

Comments

@jknowles
Copy link
Owner

jknowles commented Jun 1, 2019

The time has come to refactor predictInterval. It always should have been multiple functions since each piece of the prediction interval can be sampled independently. To help make future maintenance easier and unit testing more straightforward, the function should be broken up into the following components:

predict_setup() - does the work of getting the merMod pieces needed to make intervals
sample_re() - creates the random component of the prediction intervals
sample_fe() - creates the fixed component of the prediction intervals
sample_model() - gets the model fit error (if user requests) component of the prediction intervals

Then predictInterval is simply a function that calls these functions and then combines the results into different configurations depending on what the user requests.

I don't think the sub-functions should be exported, predictInterval() should always be the function the user interacts with even when they only request part of the prediction interval. But, it will make unit testing easier moving forward and should make it easier to adapt to changes in lme4 if they come up or to incorporate new model types like in issue #101

What say you @carlbfrederick - does this sound like a good plan? Any feeling on names/conventions/structure? This is mostly off the top of my head.

@humana
Copy link

humana commented Jun 2, 2022

Hi @jknowles,

I have my own version of predictInterval that I butchered so I could use it for online predictions behind an API, so I needed it to be very fast < 1 second for each call. Would the above predict_setup() be doing all the calls to ranef() and storing those results, so the actual predictInterval never has to call those again? That's where I found most/all of the time is being spent. E.g. I. have something like this in my "predict_setup()":

    cache_obj$model.ranef <- ranef(model.obj)
    cache_obj$model.ranef_condvar <- ranef(model.obj, condVar = TRUE)

And I pass that cache_obj back into predictInterval and wherever it was calling randef() I just made it reference the cached version.

Could I make one additional suggestion: To allow the "levels" parameter to be a vector, so you can have different intervals for each row. This would also make it work the same as predict.lm and other predict functions in R, where the level can be given as a vector parameter. I have done this to my own version of predictInterval, it requires about 10 lines of code.

So basically predictInterval(model, newdata, level = c(0.8, 0.9, 0.7)) or more likely to be used as.. predictInterval(model, newdata, level = newdata$level)

E.g.:

  # extend functionality to support different levels per row in newdata
  if (length(level) == 1) {
    level <- rep(level, nrow(newdata)) 
  } else if (length(level) != nrow(newdata)) {
    warning("Length of level does not match number of rows in newdata")
  }

<snip>

  # Output prediction intervals
  if (stat.type == "median") {
    #outs[, 1:3] <- t(apply(yhat, 1, quantile, prob = c(0.5, upCI, loCI),
    #                       na.rm=TRUE))
    outs[, 1:3] <- t(sapply(1:nrow(yhat), function(i) quantile(yhat[i, ], prob = c(0.5, upCI[i], loCI[i]), na.rm=TRUE)))
  }
  if (stat.type == "mean") {
    # outs$fit <- apply(yhat, 1, mean, na.rm=TRUE)
    # outs[, 2:3] <- t(apply(yhat, 1, quantile, prob = c(upCI, loCI),
    #                        na.rm=TRUE))
    outs[, 2:3] <- t(sapply(1:nrow(yhat), function(i) quantile(yhat[i, ], prob = c(upCI[i], loCI[i]),
                           na.rm=TRUE)))
}

<snip>

      if( stat.type == "median"){
        # pi.comps[[i]] <-  t(apply(pi.comps[[i]], 1, quantile,
        #                           prob = c(0.5, upCI, loCI), na.rm=TRUE))
        pi.comps[[i]] <- t(sapply(1:nrow(pi.comps[[i]]), function(x) quantile(prob = c(0.5, upCI[x], loCI[x]), na.rm=TRUE)))
        pi.comps[[i]] <- as.data.frame(pi.comps[[i]])
        names(pi.comps[[i]]) <- c("fit", "upr", "lwr")
      }
      if(stat.type == "mean"){
        tmp <- pi.comps[[i]]
        pi.comps[[i]] <- data.frame("fit" = rep(NA, N), "upr" =NA,
                                    "lwr" = NA)
        pi.comps[[i]]$fit <- apply(tmp, 1, mean, na.rm=TRUE)
        # pi.comps[[i]][, 2:3] <- t(apply(tmp, 1, quantile, prob = c(upCI, loCI), na.rm=TRUE))
        pi.comps[[i]][, 2:3] <- t(sapply(1:nrow(tmp), function(x) quantile(prob = c(upCI[x], loCI[x]), na.rm=TRUE)))
      }

@jknowles
Copy link
Owner Author

jknowles commented Jun 2, 2022

This is great - caching the random effects is definitely what I had in mind. Depending on the size of the model we may even be able to get more speed caching the simulated matrix instead of generating it each times in sample_re() for example, but I'd need to do some testing to find the tipping point there.

I like the idea of making the intervals a vector. The original use case for predictInterval was to generate predictions on very large models with millions of rows and hundreds of thousands of levels, so adding additional interval rows wasn't on my mind. But now RAM / CPU speed means we wouldn't need to worry about that as much, and I can see the value in many places when people want to compare uncertainty levels or visualize them.

Thanks for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants