Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glmGamPoi scales cubically with the number of design matrix columns #41

Open
frederikziebell opened this issue Nov 28, 2022 · 5 comments

Comments

@frederikziebell
Copy link

frederikziebell commented Nov 28, 2022

Hi Constantin,

when the number of design matrix columns increases (and thus the number of samples for a fixed number of replicates), glmGamPoi scales cubically. Consider the following code where 3 genes are fitted with 3 replicates per condition and a varying number of conditions:

library("glmGamPoi")
library("DESeq2")
library("tidyverse")

# adapted from DESeq2
make_example_dds <- function(n_genes, n_replicates, n_conditions){

  dispMeanRel <-  function(x) 4/x + 0.1
  
  beta <- matrix(rnorm(n_genes*n_conditions, mean = 4, sd = 2), ncol = n_conditions)
  dispersion <- dispMeanRel(2^(beta[, 1]))
  colData <- DataFrame(condition = factor(rep(paste0("cond", 1:n_conditions), n_replicates)))
  x <- model.matrix.default(~colData$condition)
  mu <- t(2^(x %*% t(beta)))
  
  countData <- matrix(rnbinom(mu, mu = mu, size = 1/dispersion), ncol = ncol(mu))
  mode(countData) <- "integer"

  design <- as.formula("~ condition", env = .GlobalEnv)
  object <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = design)
  object
}

calc_runtime <- function(n_conditions){
  set.seed(1)
  n_replicates <- 3
  dds <- make_example_dds(n_genes = 3, n_replicates = n_replicates, n_conditions = n_conditions)
  time <- system.time(
    fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, n_replicates*n_conditions))
  )
  as.double(time["elapsed"])
}

n_conditions <- c(10,25,seq(50,600,50))
res_glm_gp <- tibble(
  n_conditions = n_conditions,
  runtime = map_dbl(n_conditions, calc_runtime)
)

ggplot(res_glm_gp, aes(n_conditions, runtime^(1/3))) +
    geom_point() +
    geom_smooth(method="lm", se=F)

image

Could this be related to the QR decomposition of the design matrix and ultimately not be improved?

@frederikziebell frederikziebell changed the title glmGamPoi scales cubically with the number of design columns glmGamPoi scales cubically with the number of design matrix columns Nov 28, 2022
@const-ae
Copy link
Owner

Hey, thanks for documenting the aspect. I am not terribly surprised though, that glmGamPoi has a cubic runtime complexity with the number of covariates because the same is true for regular linear regression (https://math.stackexchange.com/a/84503/492945), and I am internally using an iterative reweighted least squares algorithm.

I would be curious if there is a clever trick to speed this up. I could imagine that there could be interesting pointers in the GWAS literature, because they have to fit gigantic linear models all the time :)

@frederikziebell
Copy link
Author

From a practical standpoint, if there are many conditions, maybe one could subdivide the dataset (perhaps in an overlapping way that the control conditions are in every subset) and hope for very similar dispersion estimates? In this way, the linear runtime of estimating on all subsets would outperform the cubic runtime on the whole dataset.

@const-ae
Copy link
Owner

Sorry, I'm a bit confused and that's probably simply because I worked too long today 🙈
I estimate a single overdispersion for each gene so that is independent of the number of covariates. The cubic scaling comes from the coefficient estimates and I don't understand how subdividing helps there.
Best, Constantin

@frederikziebell
Copy link
Author

frederikziebell commented Nov 30, 2022

Here is an example, using the make_example_dds function from above. A dataset with 201 conditions is fitted, one time with all conditions simultaneously, and one time in 10 chunks with samples of condition 1 (to always have the same intercept), and 20 additional conditions. By averaging the overdispersion estimates over the 10 chunks, one arrives at a good approximation of the overdispersion estimate of the full model:

set.seed(1)
n_genes <- 10
n_replicates <- 3
n_conditions <- 201
dds <- make_example_dds(n_genes, n_replicates, n_conditions)

fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, ncol(dds)))

beta_full <- fit$Beta
overdispersions_full <- fit$overdispersions

# split samples in conditions 2 to 201 in 10 chunks of size 20
# and estimate parameters for each chunk separately,
# i.e. chunk 1: samples of conditions 1, 2, ... 20
#      chunk 2: samples of conditions 1, 21, ... 30
chunk_size <- 20

fit_chunk <- 2:201 %>% 
  split(ceiling(seq_along(.) / chunk_size)) %>% 
  map(function(conditions){
    dds_chunk <- dds[,dds$condition %in% c("cond1",str_c("cond",conditions))]
    fit_chunk <- glmGamPoi::glm_gp(assay(dds_chunk), design = design(dds_chunk), col_data = colData(dds_chunk), size_factors = rep(1, ncol(dds_chunk)))    
  })

# matrix of Beta estimates
beta_chunk <- cbind(fit_chunk[[1]]$Beta[,1,drop=F],do.call("cbind",map(fit_chunk,~.x$Beta[,-1])))

# overdispersions by averaging estimates for each gene
# over all chunks
overdispersions_chunk <- map(fit_chunk,~.x$overdispersions) %>% 
  do.call("cbind",.) %>% 
  rowMeans()

# overdispersions of full vs chunked model
tibble(
  full = overdispersions_full,
  chunk = overdispersions_chunk
) %>% 
  ggplot(aes(full, sub)) +
    geom_point() +
    geom_abline() +
    labs(title = "overdispersion")

# beta estimates
reshape2::melt(beta_full, value.name="beta_full") %>% 
  left_join(reshape2::melt(beta_chunk, value.name="beta_chunk"), by=c("Var1","Var2")) %>% 
  filter(beta_full > -1e-7) %>% 
  ggplot(aes(beta_full, beta_chunk)) +
    geom_point() +
    geom_abline() +
    labs(title = "Beta")

image

image

The full model takes 10sec on my machine, the chunked version 0.4sec.

I followed up on your hint and looked briefly into GWAS literature. One strategy is to use variable selection to potentially not estimate all covariates (see the time and memory requirements section here, and the referenced paper therein).

@const-ae
Copy link
Owner

Hey Frederik,

I am very sorry for the late reply, there was too much else going on. You can make the regular function fast by setting do_cox_reid_adjustment = FALSE. The following takes 0.135 seconds on my computer (i.e., a 75x speed-up):

system.time(
  fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, ncol(dds)), 
                do_cox_reid_adjustment = FALSE)
)

The problem was that I assumed the issue was in estimating the beta coefficients, where I cannot beat the cubic scaling. I did not realize that your design matrix had a convenient form where we know that the coefficients are independent and can thus make it scale linear with the number of coefficients. In fact, the estimating of the beta coefficients was already fast but the overdispersion estimate (which I thought linear) scaled cubically because calculating the Cox-Reid (McCarthy 2012) adjustment is implemented as a simple matrix multiplication ($X^T diag(w) X$).

I could implement a faster version of the Cox-Reid adjustment that uses the known sparsity of the design matrix. However, that would complicate the methods significantly, so I am hesitant. There is a difference between the overdispersion estimates, but it is not too big:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants