Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide any option for inference? #512

Open
tomwenseleers opened this issue Jun 6, 2023 · 12 comments
Open

Provide any option for inference? #512

tomwenseleers opened this issue Jun 6, 2023 · 12 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tomwenseleers
Copy link

tomwenseleers commented Jun 6, 2023

More of a feature request - do you think it might be possible to implement any form of statistical inference in abess? E.g. provide confidence intervals on coefficients via bootstrapping? Or repeatedly bootstrap the dataset & refit an abess model on each of those bootstrapped datasets, calculate the union of selected variables across all fits on bootstrapped datasets, repeat this until the size of the union of selected variables no longer grows & then refit a single regular GLM using base R's GLM function on this union of selected variables (I don't know if something like this has ever been suggested in the literature - I thought I had seen some suggestion along these lines in one of the Tibshirani articles on selective inference, but I can't seem to find it now).

@Mamba413 Mamba413 added the enhancement New feature or request label Jun 6, 2023
@Mamba413
Copy link
Collaborator

Mamba413 commented Jun 6, 2023

Thank you for your constructive comment!

I have quickly read some statistical inference papers on high-dimensional variable selection. One method is based on the residual bootstrap method (DOI: 10.1198/jasa.2011.tm10159), which can provide a confidence interval for a sparse regression coefficient estimate.

Another method aims to find a set of variable sets that can cover the true effective variable set with high confidence (https://arxiv.org/pdf/2209.09299.pdf). The bootstrap method you mentioned is also discussed in Section 4.1 of this paper (https://arxiv.org/pdf/2209.09299.pdf). If you are interested, we would be happy to share our code with you once we have successfully implemented it.

@tomwenseleers
Copy link
Author

Many thanks! If you would like to go have inference based on bootstrapping, one possibility could perhaps also be to register your package against the marginalmeans package, that also has some bootstrapping based inference implemented, https://vincentarelbundock.github.io/marginaleffects/articles/bootstrap.html#bootstrap, and it also has an extension mechanism, https://vincentarelbundock.github.io/marginaleffects/articles/extensions.html, and the author is always quite happy to implement extra features... Could be nice if you would like to allow users to be able to calculate posthoc contrasts and stuff like that... If all you would like to show is the significance of each individual coefficient from zero then what you suggest above would be enough of course.

@Mamba413
Copy link
Collaborator

Mamba413 commented Jun 7, 2023

I have read the documentation for the marginaleffect package (https://vincentarelbundock.github.io). It is a great package that provides detailed tutorials. I also largely agree that we can implement bootstrapping using this package. However, I am not entirely certain whether calculating post-hoc contrasts is of great practical help (I apologize if this is due to my lack of familiarity with this field). Our team will discuss this point internally, which may take some time, but we will get back to you as soon as possible.

@tomwenseleers
Copy link
Author

tomwenseleers commented Jun 8, 2023

Linking this post here which may be relevant: https://stats.stackexchange.com/questions/617990/inference-for-high-dimensional-models-based-on-running-a-glm-on-the-union-of-s?noredirect=1#comment1148780_617990 and some quick tests on that idea of doing inference by fitting a regular GLM on the union of selected variables over many bootstrapped datasets: https://github.com/tomwenseleers/L0glm/blob/master/benchmarks/benchmarks%20multichannel%20deconvolution%20gaussian%20L0glm%20abess%20glmnet.R. In the end what works best is to select a set of variables that appear in at least 30% of the best subset fits on bootstrapped data, similar to what is done in the bolasso to achieve variable selection consistency (https://arxiv.org/abs/0804.1302) (but where the threshold is typically taken much higher, e.g. at 0.9, leading to slower convergence to the true set). Such an approach might be convenient to avoid having to deal interfacing the package to marginaleffects and emmeans and such. And maybe you are right that the people that are fitting models with 10s of thousands of variables will typically not be the ones that will start making effect plots of particular factors in their model & that basic inference on the model coefficients is probably sufficient. For those doing best subset selection on models with 100s of variables & best subsets with less than 10 contributing variables or so things might be different. Or cases where the solution is extremely sparse - datasets with 10s of thousands of variables but still only a couple of contributing variables.

@AnrWang
Copy link
Collaborator

AnrWang commented Jun 16, 2023

Thank you for your suggestion regarding statistical inference in abess. We found the methods you mentioned to be relatively effective. Additionally, we have developed a method that combines abess with the technique proposed in (https://arxiv.org/pdf/2209.09299.pdf), and we have obtained promising results. We have attached a screenshot of the output for your reference.

image

Furthermore, we would like to share our code with you. We hope this will be useful for users who require advanced statistical inference features. Thank you once again for your valuable input.

## library abess
library(abess)

## functions

# The following is a function to determine whether a column vector is in a matrix.
# If the column vector is in the matrix, this function returns its position.

vec_in_matrix <- function(mat, vec) {
  k_max <- ncol(mat)
  flag <- 0
  k <- 1
  while(!flag & k <= k_max) {
    if(!sum(mat[, k] != vec)) flag <- k
    k <- k + 1
  }
  return(flag)
}

# Below is a function that takes input samples and returns a set of selected variables 
# that with high probability contains the actual support set.
# The selected variables are stored in the output matrix as column vectors.

Search_Candidate_Models <- function(x, y, cyc=100, tune="cv") {
  n <- length(y)
  p <- ncol(x)
  Sd_abess <- matrix(2, p, 1)
  for(i in 1:cyc) {
    x_rs <- cbind(x, rnorm(n))
    abess_fit <- abess(x_rs, y, always.include = p + 1, tune.type = tune) 
    beta_j <- as.numeric(as.matrix(abess_fit$beta[-(p + 1), paste(abess_fit$best.size)]) != 0) 
    if(!vec_in_matrix(Sd_abess, beta_j)) Sd_abess <- cbind(Sd_abess, beta_j)
  }
  Sd_abess <- Sd_abess[, -1]
  return(as.matrix(Sd_abess))
}

# Below is a function that computes the p-value for each support set in the candidate models.

p_value_of_tua <- function(x, y, tua, cyc_p_value=100, tune="cv") {
  n <- length(y)
  p <- ncol(x)
  if(!sum(tua)) return(0)
  x_tua <- x[, tua != 0]
  H <- x_tua %*% solve(t(x_tua) %*% x_tua) %*% t(x_tua)
  a_obs <- H %*% y
  b_obs <- sqrt(sum(((diag(n) - H) %*% y) ^ 2))
  size_tua <- sum(tua)
  Tua_abess <- matrix(2, p, 1)
  times_Tua_abess <- c()
  abess_tua <- as.numeric(as.vector(abess(x, y, support.size = size_tua, tune.type = tune)$beta) != 0)
  for(j in 1:cyc_p_value) {
    u_j <- (diag(n) - H) %*% rnorm(n)
    y_j <- a_obs + b_obs / sqrt(sum((u_j) ^ 2)) * u_j
    abess_tua_j <- as.numeric(as.vector(abess(x, y_j, support.size = size_tua, tune.type = tune)$beta) != 0)
    where_tua <- vec_in_matrix(Tua_abess, abess_tua_j)
    if(where_tua) {
      times_Tua_abess[where_tua - 1] <- times_Tua_abess[where_tua - 1] + 1
    } else {
      Tua_abess <- cbind(Tua_abess, abess_tua_j)
      times_Tua_abess <- c(times_Tua_abess, 1)
    }
  }
  where_tua <- vec_in_matrix(Tua_abess, abess_tua)
  if(where_tua) {
    p_value_abess <- sum(times_Tua_abess[times_Tua_abess <= times_Tua_abess[where_tua - 1]]) / cyc_p_value
  } else {
    p_value_abess <- 0
  }
  return(p_value_abess)
}

# Below is a function that obtains the confidence set for support sets with confidence level of 1-alpha.
# The support sets are stored in the output matrix as column vectors.

Confidence_set <- function(x, y, Sd, alpha=0.05, cyc_p_value=100, tune="cv") {
  gam <- matrix(2, p, 1)
  for(i in 1:ncol(Sd)) {
    if(p_value_of_tua(x, y, Sd[, i], cyc_p_value = cyc_p_value, tune = tune) >= alpha) {
      gam <- cbind(gam, Sd[, i])
    } 
  }
  gam <- gam[, -1]
  return(as.matrix(gam))
}

# The following is the function to calculate the confidence interval of the i-th variable.

confidence_interval <- function(x, y, Sd, i, alpha=0.05) {
  ci <- NULL
  set <- which(Sd[i,] != 0)
  rname <- paste0("`", i, "`")
  for(j in 1:length(set)) {
    df <- as.data.frame(cbind(y, x[, Sd[, set[j]] != 0]))
    colnames(df) <- c("y", paste(which(Sd[, set[j]] != 0)))
    lm_model <- lm(y ~ ., data = df)
    fit <- confint(lm_model, level = 1 - alpha)
    if(is.null(ci)) {
      ci <- fit[rname, ]
    } else {
      if(fit[rname, 1] < ci[1]) ci[1] <- fit[rname, 1]
      if(fit[rname, 2] > ci[2]) ci[2] <- fit[rname, 2]
    }
  }
  if(is.null(ci)) ci <- c(0, 0)
  return(ci)
}


## Next is an example.

## import abess
library(abess)

## functions

# The following is a function to determine whether a column vector is in a matrix.
# If the column vector is in the matrix, this function returns its position.

vec_in_matrix <- function(mat, vec) {
  k_max <- ncol(mat)
  flag <- 0
  k <- 1
  while(!flag & k <= k_max) {
    if(!sum(mat[, k] != vec)) flag <- k
    k <- k + 1
  }
  return(flag)
}

# Below is a function that takes input samples and returns a set of selected variables 
# that with high probability contains the actual support set.
# The selected variables are stored in the output matrix as column vectors.

Search_Candidate_Models <- function(x, y, cyc=100, tune="cv") {
  n <- length(y)
  p <- ncol(x)
  Sd_abess <- matrix(2, p, 1)
  for(i in 1:cyc) {
    x_rs <- cbind(x, rnorm(n))
    abess_fit <- abess(x_rs, y, always.include = p + 1, tune.type = tune) 
    beta_j <- as.numeric(as.matrix(abess_fit$beta[-(p + 1), paste(abess_fit$best.size)]) != 0) 
    if(!vec_in_matrix(Sd_abess, beta_j)) Sd_abess <- cbind(Sd_abess, beta_j)
  }
  Sd_abess <- Sd_abess[, -1]
  return(as.matrix(Sd_abess))
}

# Below is a function that computes the p-value for each support set in the candidate models.

p_value_of_tua <- function(x, y, tua, cyc_p_value=100, tune="cv") {
  n <- length(y)
  p <- ncol(x)
  if(!sum(tua)) return(0)
  x_tua <- x[, tua != 0]
  H <- x_tua %*% solve(t(x_tua) %*% x_tua) %*% t(x_tua)
  a_obs <- H %*% y
  b_obs <- sqrt(sum(((diag(n) - H) %*% y) ^ 2))
  size_tua <- sum(tua)
  Tua_abess <- matrix(2, p, 1)
  times_Tua_abess <- c()
  abess_tua <- as.numeric(as.vector(abess(x, y, support.size = size_tua, tune.type = tune)$beta) != 0)
  for(j in 1:cyc_p_value) {
    u_j <- (diag(n) - H) %*% rnorm(n)
    y_j <- a_obs + b_obs / sqrt(sum((u_j) ^ 2)) * u_j
    abess_tua_j <- as.numeric(as.vector(abess(x, y_j, support.size = size_tua, tune.type = tune)$beta) != 0)
    where_tua <- vec_in_matrix(Tua_abess, abess_tua_j)
    if(where_tua) {
      times_Tua_abess[where_tua - 1] <- times_Tua_abess[where_tua - 1] + 1
    } else {
      Tua_abess <- cbind(Tua_abess, abess_tua_j)
      times_Tua_abess <- c(times_Tua_abess, 1)
    }
  }
  where_tua <- vec_in_matrix(Tua_abess, abess_tua)
  if(where_tua) {
    p_value_abess <- sum(times_Tua_abess[times_Tua_abess <= times_Tua_abess[where_tua - 1]]) / cyc_p_value
  } else {
    p_value_abess <- 0
  }
  return(p_value_abess)
}

# Below is a function that obtains the confidence set for support sets with confidence level of 1-alpha.
# The support sets are stored in the output matrix as column vectors.

Confidence_set <- function(x, y, Sd, alpha=0.05, cyc_p_value=100, tune="cv") {
  gam <- matrix(2, p, 1)
  for(i in 1:ncol(Sd)) {
    if(p_value_of_tua(x, y, Sd[, i], cyc_p_value = cyc_p_value, tune = tune) >= alpha) {
      gam <- cbind(gam, Sd[, i])
    } 
  }
  gam <- gam[, -1]
  return(as.matrix(gam))
}

# The following is the function to calculate the confidence interval of the i-th variable.

confidence_interval <- function(x, y, Sd, i, alpha=0.05) {
  ci <- NULL
  set <- which(Sd[i,] != 0)
  rname <- paste0("`", i, "`")
  for(j in 1:length(set)) {
    df <- as.data.frame(cbind(y, x[, Sd[, set[j]] != 0]))
    colnames(df) <- c("y", paste(which(Sd[, set[j]] != 0)))
    lm_model <- lm(y ~ ., data = df)
    fit <- confint(lm_model, level = 1 - alpha)
    if(is.null(ci)) {
      ci <- fit[rname, ]
    } else {
      if(fit[rname, 1] < ci[1]) ci[1] <- fit[rname, 1]
      if(fit[rname, 2] > ci[2]) ci[2] <- fit[rname, 2]
    }
  }
  if(is.null(ci)) ci <- c(0, 0)
  return(ci)
}


## Next is an example.

n <- 20
p <- 100
support.size <- 3
alpha <- 0.05

dataset <- generate.data(n, p, support.size)
Sd <- Search_Candidate_Models(dataset[["x"]], dataset[["y"]])
cat("The size of the candidate set:", ncol(Sd), 
    "\nWhether the actual support set is in the candidate set:", vec_in_matrix(Sd, dataset[["beta"]] != 0) != 0)

gam <- Confidence_set(dataset[["x"]], dataset[["y"]], Sd, alpha)
cat("\nThe size of the confidence set:", ncol(gam), 
    "\nWhether the actual support set is in the confidence set:", vec_in_matrix(gam, dataset[["beta"]] != 0) != 0)

active_set <- which(dataset[["beta"]] != 0)
for(i in 1:support.size) {
  ci <- confidence_interval(dataset[["x"]], dataset[["y"]], Sd, active_set[i], alpha = alpha)
  cat("\nThe value of the ", active_set[i], "-th variable:", dataset[["beta"]][active_set[i]], 
      "\tThe confidence interval of this variable:[", ci[1], ",", ci[2], "]")
}

@DarioS
Copy link

DarioS commented Oct 20, 2023

When will confidence intervals be available? I showed results to a heart disease clinician and she asked me about them.

@AnrWang
Copy link
Collaborator

AnrWang commented Oct 20, 2023

Thank you for your inquiry. It is an excellent question. In the mentioned algorithm, we first select candidate set with a support set as its foundation, ensuring that it likely includes the true support set (corresponding to the function Search_Candidate_Models in the code above). Then, for each support set in the candidate set, confidence intervals for the parameters are calculated, and the union of these intervals is taken (corresponding to the function Confidence_set in the code above). The confidence intervals can be guaranteed with a certain level of confidence.

@tomwenseleers
Copy link
Author

Do you think it might also be an option to calculate the variance covariance matrix from the Fisher information matrix calculated for the selected variables, and repeat this procedure for fits on bootstrapped data (e.g. 20x), and then pad the vcov matrices with zeroes for any variables not selected in the particular fit, after which we could calculate the average vcov matrix from the average of these sparse vcov matrices. And then calculate SEs and CIs on the average coefs over all bootstrapped datasets from this?

@Mamba413
Copy link
Collaborator

Hi @tomwenseleers, thank you for your input. Could you please provide corresponding statistical references for the method you mentioned above? Perhaps we should first thoroughly review the literature before providing you with a more definitive response.

@tomwenseleers
Copy link
Author

@Mamba413 No I haven't seen this exact method being used - perhaps mainly because until recently we didn't have access to good L0 penalized GLM fitting engines.... But it would just seem like a very simple method that logically speaking could work - but it would need to be verified numerically if this procedure would end up having the correct coverage. So the method would entail that for every abess fit on bootstrapped datasets one would calculate the Fisher Information Matrix for the selected variables as FIM=(1/dispersion)*Xw'Xw where Xw=sqrt(weights)*X with weights=working weights, after which the variance-covariance matrix would be vcov=solve(FIM) and the SE coefficients on the link scale would be sqrt(diag(vcov)). If in any fit some variables are not selected I would think that the covariance with those variables should then be estimated as zero. If there was no uncertainty over the actual true support one could just do this for the selected variables and get away not averaging the vcov matrices calculated from fits on bootstrapped data. The SEs and CIs and p values would then be conditional on the inferred support being correct & the approach would amount to just doing a GLM on the abess selected subset. If there is uncertainty over the true support then averaging the vcov matrices would seem like a logical way to take into account that uncertainty (it would assume that with a certain probability the covariance with a given parameter might be estimated as zero while in other cases would be estimated being positive). Question of course is how many bootstrap replicates one would have to do - perhaps that could be guided by the point at which the set size of the union of selected variables would start to reach an asymptote? Plain bootstrapping would of course also be possible - just doing your abess fit 1000 times. But that would be slow & with an approach like I mention here I could imagine one could get away maybe just doing 20 bootstrap replicates... Maybe a quick numeric test could work to see if this method would work & would give you approx. the correct coverage...

@Mamba413
Copy link
Collaborator

@tomwenseleers Thank you for providing us with a comprehensive explanation of your proposed method. After conducting a thorough search based on your description, we were unable to find any relevant literature discussing this approach. Our team has extensively deliberated on this internally, and ultimately, we have decided not to allocate additional resources for the implementation and maintenance of this feature. This decision is primarily due to the fact that we often find ourselves needing to respond to user inquiries regarding the feasibility of such methods, yet our understanding in this regard remains somewhat unclear.

Considering that the implementation of this method does not pose significant challenges, we wholeheartedly welcome you and your team to build upon our work, conduct experiments or theoretical analyses, and publish your research findings in relevant journals. When the time comes, we would be delighted to explore various ways to seamlessly integrate this method into abess.

Lastly, we would like to express our sincere gratitude for your active usage of abess and your proactive inquiries. Your engagement is greatly appreciated.

@Mamba413 Mamba413 added the help wanted Extra attention is needed label Oct 24, 2023
@tomwenseleers
Copy link
Author

tomwenseleers commented Oct 24, 2023

No worries - I have some bioinformatics master students working on this exact topic to figure out what approach would or would not work... I'll let you know once we find out... In general, I think having provisions for inference, including for high dimensional inference, would be a very worthwhile and much requested feature. But it is true it would ideally need some theory...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants