Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request – predict.brms: allow new values for random effects fit with splines #1554

Open
kat3stewart opened this issue Oct 27, 2023 · 8 comments
Labels

Comments

@kat3stewart
Copy link

The predict.brmsfit() function includes an argument allow_new_levels, which allows one to predict on new data containing group levels that were not present in the original data used to fit the model. However, this functionality does not extend to models whose random effects are fit with splines (as stated in the comments section of this post: https://discourse.mc-stan.org/t/predict-error-with-nested-random-effect-structure/4817/13).

If I understand correctly, the work-around for this issue is to re-run the model without a spline on the random effect. However, in some cases, this is not an ideal situation (e.g., when someone is explicitly interested in explicitly modelling the non-linear effects of group levels). For me, personally, I would much prefer to keep the splines on my random effects. Thus, is it possible to add this functionality to the predict function?

@pminasandra
Copy link

Definitely. This feature would be extremely helpful!

@paul-buerkner
Copy link
Owner

Can you post a minimal reproducible example here to explain what you have in mind?

@kat3stewart
Copy link
Author

kat3stewart commented Oct 30, 2023

My model takes several weeks to run so it's likely more efficient if I privately send you the fit model as well as the data I would like to predict on. Would this be okay?

But, so others who view this post have an idea of the situation, I have trained the following model with 70% of my data:

m1 <- brm(bf(y|trials(1)~s(x1,k=6,bs="tp", m=2) + 
                s(x2, k=12, bs="cc",m=2) + 
                x3 + s(x2, by=x3, bs="cc", k=12, m=1) + 
                s(x4, bs="re") + s(x4, x2, bs="re")), 
           data=df, family=zero_inflated_binomial(link="logit",link_zi="logit"), 
           knots = list(x2 = c(0.5,12.5)),  prior=prior1, 
           cores=4, chains=4, thin=10, iter=4000, save_pars = save_pars(all = TRUE),
           control = list(adapt_delta=0.99), backend = "cmdstanr")

where y = a binomial response (1,0)
x1 and x2 = continuous predictors
x3 = a factor with 59 levels
x4 = >1000 individuals sampled monthly across the 59 levels of the factor

What I would now like to do is predict on the remaining 30% of data using predict.brms(). However, when I try to do this, I get the error that new levels are not allowed. This is because I have new individuals for my x4 variable. I have tried adding the argument allow_new_levels = TRUE, but I still get the same error. From my understanding, this is because I have fit the random effects with splines.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Oct 30, 2023 via email

@kat3stewart
Copy link
Author

kat3stewart commented Oct 31, 2023

Apologies, it took me some time to figure out how to simulate data that captures the core features of my actual data set, but please find below a minimally reproducible example. The model was run with brms version 2.20.4; cmdstanr version 0.5.3 and rstan version 2.32.3.

### (1) Simulate predictor variables

library(tidyverse)

df <- data.frame(
  x1=rep(c(1,2,3),each=1200), 
  x2=rep(c(1:12),each=100,times=3),
  grp=rep(c(letters[1:10]),each=10,times=36),
  ind=as.character(rep(c(1:100),times=36))
)

# where:
# x1 = year
# x2 = month
# grp = group id
# ind = sampled individuals

#add unique individual id/ per year
df$id <- paste(df$ind, df$x1, sep="_")

### (2) Add binomial response variable with distinct seasonal trend per group

ids <- unique(df$id)     #list of unique individual-by-year ids
my_grps <- letters[1:10] #list of groups

#list of probabilities by group
probs <- data.frame(
  grp=my_grps,
  pr=c(0.2, 0.2, 0.3, 0.2, 0.4, 0.3, 0.2, 0.2, 0.6, 0.3) 
)

newdata=c() #create empty data frame to store loop output
count=1

for(i in 1:length(ids)){
  temp=df[which(df$id==ids[i]),]
  temp$resp <- 0 #add empty response column
  grp=unique(temp$grp)
  pr=probs$pr[which(probs$grp==grp)] #probability of getting a '1' based on group
  m=rbinom(n = 12, size = 1, prob = pr) #randomly draw monthly responses 
  s=as.numeric(sum(m)-1) #number of successes-1
  #create seasonal pattern of successes 
  p=as.numeric(unique(match(grp,my_grps))) #first month according to group letter
  v=rep(0,12) #vector of zeroes (1 per month)
  diff=(12-(p+s)) 
  d=ifelse(diff<0, (abs(diff)),0)
  v[p:((p+s)-d)] <- 1 #replace the 0s with 1s (start with first month 'p' and add 's' months)
  v[0:d] <- 1 #if applicable, add remaining sampled 1s to start of vector
  temp$resp=v #fill response column with values in 'v'
  newdata[count]=list(temp)
  count=count+1
}
df <- do.call(rbind, newdata)

### (3) Split data into training/validation sets
b <- sample(letters[1:10], replace=F, size=3) #randomly sample three grps
df70=df[!(df$grp %in% b),] #training set
df30=df[df$grp %in% b,]    #validation set
  
### (4) Run Model
library(brms)
library(rstan)
library(cmdstanr)

#set prior
prior1 <- c(prior(normal(0, 1.5), class = Intercept),
            prior(normal(0, 1.5), class = b),
            prior(exponential(1), class = sds))

options(mc.cores = parallel::detectCores())

#fit model
m1 <- brm(bf(resp|trials(1)~s(x1,k=3,bs="tp", m=2) + 
                s(x2, k=12, bs="cc",m=2) + 
                grp + s(x2, by=grp, bs="cc", k=12, m=1) + 
                s(ind, bs="re") + s(ind, x2, bs="re")), 
           data=df70, family=zero_inflated_binomial(link="logit",link_zi="logit"), 
           knots = list(x2 = c(0.5,12.5)),  prior=prior1, 
           cores=4, chains=4, thin=10, iter=2000, save_pars = save_pars(all = TRUE),
           control = list(adapt_delta=0.99), backend = "cmdstanr")

#(5) Predict on new data
pred1 <- predict(m1, newdata = df30, allow_new_levels=TRUE)

When I run the prediction, I get the error:
Error: New factor levels are not allowed.
Levels allowed: 'a', 'b', 'c', 'e', 'f', 'g', 'i'
Levels found: 'd', 'h', 'j'

m1.zip

Please find attached a zip file with the model object. Let me know if you have any issues opening it or need anything else. Cheers!

@paul-buerkner
Copy link
Owner

Thank you! This is way for detailed than I expected. I would have been happy with a much simpler example, but this of course works well too. I will take a look.

@paul-buerkner paul-buerkner added this to the 2.21.0 milestone Oct 31, 2023
@kat3stewart
Copy link
Author

Great, thank-you. I really appreciate it.

@sealavi
Copy link

sealavi commented Nov 17, 2023

I have also encountered multiple occasions recently where this feature would be extremely beneficial. Would be much appreciated if it could be implemented in the next release!

@paul-buerkner paul-buerkner removed this from the 2.21.0 milestone Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants