Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glum LASSO fit not finishing when same problem fits within 1 second in glmnet #709

Open
tomwenseleers opened this issue Oct 11, 2023 · 4 comments

Comments

@tomwenseleers
Copy link

tomwenseleers commented Oct 11, 2023

Was just testing glum, but ran into a problem where a glum LASSO fit does not appear to finish (or if it does extremely slowly). The data I simulated was a spike train blurred with a Gaussian point spread function and with some Gaussian noise added with n=p=10000 and the sparse covariate matrix X then consisting of 10000 time-shifted copies of a Gaussian point spread function (i.e. it's a banded matrix) and 1000 out of 10000 coefficients being nonzero (all positive, ie using nonnegativity constraints) :
https://github.com/tomwenseleers/glmnet-benchmark/tree/master/data

However, if I try to fit a LASSO fit using glum it never appears to finish (or if it does, it would do so extremely slowly). The syntax I used is this one - am I doing something wrong here ?
https://github.com/tomwenseleers/glmnet-benchmark/blob/master/bench_glum.py

import numpy as np
from scipy.sparse import csc_matrix
from scipy.io import mmread
import os
import time
from glum import GeneralizedLinearRegressorCV
from glum import GeneralizedLinearRegressor

def get_data(n, p, data_prefix):
    # Create file paths
    X_file = os.path.join(data_prefix, f"{n}_{p}_X.mtx")
    y_file = os.path.join(data_prefix, f"{n}_{p}_y.csv")
    beta_true_file = os.path.join(data_prefix, f"{n}_{p}_beta_true.csv")
    
    # Read the data
    X = csc_matrix(mmread(X_file))  # Read .mtx file and convert to CSC format
    y = np.genfromtxt(y_file, delimiter=',', skip_header=1)
    beta_true = np.genfromtxt(beta_true_file, delimiter=',', skip_header=1)
    
    return X, y, beta_true

def timer_glr(X, y, glr):
    start = time.time()
    glr_fit = glr.fit(X, y)
    end = time.time()
    return glr_fit, end-start

# load simulated data (spike train convoluted with gaussian point spread function
# with 1000 / 10000 coefficients being nonzero (all positive) and Gaussian noise
# added
n = 10000
p = 10000
X_sparse, y, beta_true = get_data(n, p, 'data')

# determine optimal alpha using cross validation
glrcv = GeneralizedLinearRegressorCV(l1_ratio=1, # lasso
          family='normal',
          fit_intercept=False,
          # gradient_tol=1e-7,
          scale_predictors=False,
          lower_bounds = np.zeros(p), # nonnegativity constraints
          min_alpha_ratio=0.01 if n < p else 1e-4,
          solver='irls-cd',
          cv=10, # use 10-fold CV as in glmnet
          max_iter=10000,
          n_alphas=100 # as in glmnet
          )
glrcv_fit, elapsed = timer_glr(X_sparse, y, glrcv)
print("Coef:\n", glrcv_fit.coef_)
print("Intercept: ", glrcv_fit.intercept_)
print("N_iter: ", glrcv_fit.n_iter_)
print("Elapsed: ", elapsed)
# now refit on complete dataset using optimal alpha
glr = GeneralizedLinearRegressor(l1_ratio=1, # lasso
          family='normal',
          fit_intercept=False,
          # gradient_tol=1e-7,
          scale_predictors=False,
          lower_bounds = np.zeros(p), # nonnegativity constraints
          min_alpha_ratio=0.01 if n < p else 1e-4,
          solver='irls-cd',
          max_iter=10000,
          alpha=glrcv_fit.alpha_
          #n_alphas=100,
          #alpha_search=True
          )
glr_fit, elapsed = timer_glr(X_sparse, y, glr)
print("Coef:\n", glr_fit.coef_)
print("Intercept: ", glr_fit.intercept_)
print("N_iter: ", glr_fit.n_iter_)
print("Elapsed: ", elapsed)

If I fit the same problem in glmnet (version glmnet_4.1-8, which is the recent version redone in Rcpp) it finishes in less than a second:
https://github.com/tomwenseleers/glmnet-benchmark/blob/master/bench_glmnet.R

library(glmnet)
library(microbenchmark)
library(Matrix)
library(L0glm) # my own experimental GLM package to fit L0 penalized GLMs
library(export)

setwd("~/Github/glmnet-benchmark")

data_prefix = 'data'
n = 10000L
p = 10000L
 
get_data = function(n, p) {
  list(X=as(as(readMM(paste0(data_prefix, '/', n, '_', p, '_X.mtx')), "CsparseMatrix"), "dgCMatrix"), 
       y=read.csv(paste0(data_prefix, '/', n, '_', p, '_y.csv'), header=T),
       beta_true=read.csv(paste0(data_prefix, '/', n, '_', p, '_beta_true.csv'), header=T))
}

timer_glmnet = function(X, y) {
  time.out = microbenchmark({  cvfit <- cv.glmnet(X, y, family='gaussian', 
                                                  tol=1e-14, standardize=F, 
                                                  alpha=1, # LASSO
                                                  lower.limits=0, # nonnegativity constraints
                                                  intercept=FALSE, # no intercept
                                                  nlambda = 100) 
  fit <- glmnet(X, y, family='gaussian', 
                tol=1e-14, standardize=F, 
                alpha=1, # LASSO
                lower.limits=0, # nonnegativity constraints
                intercept=FALSE, # no intercept
                nlambda = 100) }, 
  times=1L, 
  unit='s')
  coefs = coef(fit, s=cvfit$lambda.min)
  
  list(fit=fit, coefs=coefs, elapsed=summary(time.out)$mean)
}


dat = get_data(n, p)
X_dense = as.matrix(dat$X)
X_sparse = dat$X
y = dat$y[,1]
coefs_true = dat$beta_true[,1]

## timings for glmnet when fit as sparse system ####
out_sparse = timer_glmnet(X_sparse, y)
R = cor(as.vector(out_sparse$coefs)[-1], coefs_true)
print("Correlation between estimated & true coefs:\n")
print(round(R, 4)) 
print(paste("N_iter:", out_sparse$fit$npasses)) 
print(paste("Elapsed:", out_sparse$elapsed))
# [1] "Correlation between estimated & true coefs:\n"
# > print(round(R, 4)) 
# [1] 0.9092
# > print(paste("N_iter:", out_sparse$fit$npasses)) 
# [1] "N_iter: 762"
# > print(paste("Elapsed:", out_dense$elapsed))
# [1] "Elapsed: 0.9451085"

plot(x=as.vector(out_sparse$coefs)[-1], y=coefs_true, pch=16, col='steelblue',
     xlab="estimated coefficients", ylab="true coefficients",
     main=paste0("glmnet nonnegative LASSO regression\n(n=", n,", p=", p,", R=", round(R,4), ")"))
dir.create("./plots/")
graph2png(file="./plots/glmnet_LASSO_true_vs_estimated_coefs.png", width=7, height=5)

table(as.vector(out_sparse$coefs)[-1]!=0, 
      coefs_true!=0, dnn=c("estimated beta nonzero", "true beta nonzero"))
#                 true beta nonzero
# estimated beta nonzero FALSE TRUE
#                  FALSE  7929  195
#                  TRUE   1071  805

glmnet_LASSO_true_vs_estimated_coefs

Fitting it with the covariate matrix coded as dense is slower, but still doable:

## timings for glmnet when fit as dense system ####
out_dense = timer_glmnet(X_dense, y)
R = cor(as.vector(out_dense$coefs)[-1], coefs_true)
print("Correlation between estimated & true coefs:\n")
print(round(R, 4)) 
print(paste("N_iter:", out_dense$fit$npasses)) 
print(paste("Elapsed:", out_dense$elapsed))
# [1] "Correlation between estimated & true coefs:\n"
# > print(round(R, 4)) 
# [1] 0.9063
# > print(paste("N_iter:", out_dense$fit$npasses)) 
# [1] "N_iter: 762"
# > print(paste("Elapsed:", out_dense$elapsed))
# [1] "Elapsed: 86.74697"

And incidentally, my own R package that's in development (L0glm), to fit L0 penalized GLMs using an iterative adaptive ridge procedure, can fit it 4x faster still than glmnet & also returns a much better quality solution with far fewer false positives & more true positives & better predictive performance (this is using a very simple algorithm where the weighted least square step in the GLM IRLS algorithm is replaced with a ridge regression with adaptive penalties, and using the eigen Cholesky LLT solver) :

system.time(L0glm_sparse_chol <- L0glm.fit(X_sparse, y, 
                                      family = gaussian(identity), 
                                      lambda = "gic", # options "aic", "bic", "gic", "ebic", "mbic"
                                      nonnegative = TRUE,
                                      solver = "eigen",
                                      method = 7L) # Cholesky LLT
) # 0.18s, i.e. 4x faster than sparse glmnet LASSO fit

R = cor(as.vector(L0glm_sparse_chol$coefficients), coefs_true)
print("Correlation between estimated & true coefs:\n")
print(round(R, 4)) # 0.9929 - much better solution than LASSO

plot(x=as.vector(L0glm_sparse_chol$coefficients), y=coefs_true, pch=16, col='steelblue',
     xlab="estimated coefficients", ylab="true coefficients",
     main=paste0("L0glm regression\n(n=", n,", p=", p, ", R=", round(R,4), ")"))
graph2png(file="./plots/L0glm_true_vs_estimated_coefs.png", width=7, height=5)

table(as.vector(L0glm_sparse_chol$coefficients)!=0, 
      coefs_true!=0, dnn=c("estimated beta nonzero", "true beta nonzero"));
# better true positive rate & much lower false positive rate than LASSO
#                 true beta nonzero
# estimated beta nonzero FALSE TRUE
#                  FALSE  8919   45
#                  TRUE     81  955

L0glm_true_vs_estimated_coefs

Any thoughts what I might be doing wrong in my usage of glum? Or if this is rather a bug?

Also, would it be possible to update the included benchmarks by any chance with a wider range of simulated data, e.g. generated using abess's function make_glm_data in
https://github.com/abess-team/abess/blob/master/python/abess/datasets.py
or in R
https://github.com/abess-team/abess/blob/master/R-package/R/generate.data.R ? E.g. for varying n & p & different correlation structures & error families?
Right now I feel the included benchmarks are out of date with developments in other packages...

@MarcAntoineSchmidtQC
Copy link
Member

The short answer to this is that glum is not designed to solve that type of problem. From the documentation (in the benchmark section): "Note that glum was originally developed to solve problems where N >> K (number of observations is larger than the number of predictors), which is the case for the following benchmarks."

I know that there is a large subset of problems where the number of coefficients is close to or larger than the number of observations, but that's simply not what glum was designed for. The algorithm currently materializes the full hessian matrix, which in your case would be a 10kx10k matrix (and would be very inefficient). There has been some effort to allow glum to work well in this type of situation but it is currently not the case.

We should maybe make this more explicit in the documentation, and as you suggested, we should update our benchmarks.

@tomwenseleers
Copy link
Author

tomwenseleers commented Oct 11, 2023

Ha that's a shame... I should say that even for the n > p case I mostly found glmnet to outperform glum, though I haven't tried with super huge n and very small p & this was just in some quick tests I tried... So I think the benchmarks definitely need updating... For very tall problems with n >>p you might also like to look at the oem package and compare performance with what you have.

I case you would be interested in also covering the p >= n case eventually & allowing for good sparse GLM fits for that case - so what I did was simply to take the plain R glm IRLS algorithm which more or less comes down to

glm_irls =
function(X, # design matrix
         y, # response
         weights = rep(1, ncol(X)), # prior observation weights 
                   	# e.g. total nr. of trials for proportions with
					family=binomial
         start = rep(1, ncol(X)), # coefficient starting values
         offset = rep(0, nrow(X)), # model offset 
         family = gaussian(identity), # distribution & link function
         maxit = 25, 
         tol = 1e-08)
{
  beta = start
  nobs = nrow(X)    
  nvars = ncol(X)  
  eval(family$initialize) # initializes n and mustart
  eta = family$linkfun(mustart) # initialize η = g(µ)
  mu = linkinv(eta) # predictions on response scale µ

  for (i in 1:maxit)
  {
    var      = family$variance(mu) # variance in function of the mean µ
    gprime   = family$mu.eta(eta) # derivative of link function w.r.t. η = d(g-1)/dη=dμ/dη
    gradient = y - mu # gradient of log-likelihood with respect to η = ∂ℓ/∂η = deviance residual
    z        = eta - offset + gradient / gprime 
             # adjusted response 
             # = linearised version of log-likelihood function ℓ around η
    W        = weights * as.vector(gprime^2 / var) 
             # W = diagonal of expected Hessian = reliability of observations
    betaold  = beta
    beta     = solve(crossprod(X,W*X), crossprod(X,W*z)) 
    # coefficient update based on quadratic approximation of log likelihood 
    # using weighted least square regression
    eta    = offset + X %*% beta # linear predictor, i.e. predictions on link scale
    mu     = family$linkinv(eta) # predictions on response scale µ = g-1(η)
       
    if(sqrt(crossprod(beta-betaold)) < tol) break
  }
  return(list(coefficients=beta, iterations=i))
}

and replace the weighted least square step beta = solve(crossprod(X,W*X), crossprod(X,W*z)) with a weighted ridge regression step with ridge penalties = (1/4)*lambda*estimated dispersion*penalty weights with penalty weights=1/beta^2 and coefficients beta < thresh = 1E-10 being taken out of the model in each step (to avoid infinite penalties). In this way, one can asymptotically approximate an L0 penalized GLM. If one then sets lambda at 2, log(n), log(p) * log(log(n)), (log(n) + 2 * (1 - log(n) / (2 * log(p))) * log(p)) or log(n * (p ^ 2) / 16) one expects the final model at convergence to be optimal in terms of AIC, BIC, GIC, EBIC or MBIC (i.e. thereby not requiring one to do any cross validation, which is a great advantage). The estimated dispersion here is initialized at 1 but is updated after it starts to converge for those families that have a dispersion parameter (e.g. gaussian, Gamma, etc, for binomial & poisson it is fixed at 1). The 1/4 factor was proven in Frommlet & Nuel 2016 to be needed to approximate an L0 penalty (strictly speaking only for the Gaussian case and for orthogonal designs, but in practice it works well for the other families and for non-orthogonal designs too). The estimated dispersion I added as Frommlet & Nuel seemed to assume it was known to be 1, which of course is never the case with Gaussian errors. So in each IRLS iteration I am updating not only the predicted values & quadratic approximation to the log likelihood & the observation weights, but also the estimated dispersion (except for binomial & poisson where it is 1) and the adaptive ridge penalties.

Given that this is fit using ridge penalties this procedure also works very well for high dimensional cases and for the p >> n case it is also possible to use the Woodbury identity to solve this ridge regression more efficiently (requiring one to solve just an n x n system as opposed to a p x p one) as in Liu et al. 2017. But it is also very efficient for very tall problems with n >> p or with n=p. Problems with millions of observations or variables are no problem. Usually one needs fewer than 20 IRLS iterations for this algo to converge & it doesn't need cross validation to tune the level of regularisation. The ridge regression can in general be solved with any least square solver (sparse or dense, e.g. the eigen Cholesky ones, or some coordinate descent or the eigen least square conjugate gradient solver), which makes this algorithm quite easy to implement (one simple algorithm being to simply augment your covariate matrix with a diagonal matrix with sqrt(lambdas) along the diagonal (or a general Thikonov regularisation matrix to support more exotic penalties, e.g. fused penalties etc) and augmenting y with zeros (or sqrt(lambdas)*prior.means if you would like to also support shrinking to non-zero centered Gaussian priors)... Box constraints or nonnegativity constraints implemented via simple clipping in each IRLS step also work very well - they work almost as well as if one would use a quadratic programming solver to impose hard constraints, e.g. using osqp (the latter would also readily allow one to impose linear inequality constraints). Anyway, let me know if at any stage in the future you might be interested in supporting L0 penalties for either p > n or n > p models, then I could advise you on that...

An extension for multitask learning problems with multivariate outcomes is also straightforward: start by fitting each task independently, average the coefficients over the different tasks & use this average as starting coefficient for the fits on each task (so that features with large average coefficients across all tasks will get smaller adaptive ridge penalties). Iterate this process until the features selected across all tasks converge. As a final step one can do a relaxed fit on the selected features with a low near-zero penalty. These relaxed group L0 penalized GLMs massively outperform e.g. group LASSO (mgaussian) glmnet fits & readily extend for any GLM distribution.

Not sufficiently proficient myself in Python to take on any development work, but would be happy to advise on the required steps, based on my experience developing an R package to fit L0 & group L0 penalized GLMs, with performance in terms of timing equal or better than glmnet & abess & L0Learn & performance in terms of quality of solution generally much better (often spectacularly so, especially for the multitask learning case)...

@MarcAntoineSchmidtQC
Copy link
Member

Thanks a lot for your detailed reply. This gives us a lot to think about and evaluate. I will try updating the benchmarks this week or the next.

We don't currently have the short-term resources to tackle this, but I will be on the lookout for an eventual contributor willing to take the dive into this. I have never worked with L0 penalties before but I will be talking to other people using the package to gauge their interest.

@tomwenseleers
Copy link
Author

tomwenseleers commented Oct 17, 2023

Thanks for that! FYI - I was benchmarking the speed of some GLM fitting engines in R the other day (speedglm, fastglm, parglm, bigglm, glmnet's bigGlm), and for tall datasets with n>>p in R parglm's parglm.fit function appeared most promising for big datasets... A Gamma GLM with n=10000000 and p=70 runs in 9s (or 4.9s if you directly call the underlying C++ function) on my laptop - that's pretty performant I think (this was using R with Intel MKL installed as BLAS) :

library(parglm)
n <- 10000000
p <- 70
y <- rgamma(n,1.5,1)
x <- round( matrix(rnorm(n*p),n,p),digits=3)
colnames(x) <-paste("v",1:p,sep = "")
da<- data.frame(y,x)
fo <- as.formula(paste("y~",paste(paste("v",1:p,sep=""),collapse="+")))   
X <- model.matrix(fo, data=da)

system.time(m <- parglm.fit(X, y, intercept = F, 
                             control = parglm.control(nthreads=8L, 
                                                      block_size=1000, 
                                                      method="FAST"), # uses the Armadillo Cholesky LLT solver (falling back on installed BLAS, in my case Intel MKL)
                             data=da,
                             family=Gamma(log))) # 9s

system.time(m2 <- parglm:::parallelglm(X=X, Ys=y, 
                                        family=paste0(Gamma(log)$family, "_", Gamma(log)$link), 
                                        start=rep(0,ncol(X)),
                                        weights=rep(1, nrow(X)), 
                                        offsets=rep(0,nrow(X)),
                                        tol=1E-8, nthreads=8L, 
                                        it_max=100L, trace=FALSE,
                                        method="FAST",  # Cholesky LLT for each block
                                        block_size=1000, 
                                        use_start=FALSE)) # 4.9s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants