Non-smoothness of mxPenalty functions #343

jhorzek · 2022-06-03T08:13:08Z

Dear all,

I am very excited to see that OpenMx now also supports regularization of parameters with lasso and similar penalty functions! If I understand it correctly, OpenMx uses a smooth approximation of the non-differentiable penalty functions; the documentation states: "Smoothing is controlled by smoothProportion. If smoothProportion is zero then the traditional discontinuous functions are used. Otherwise, smoothProportion of the region between epsilon and zero is used for smoothing." I wanted to learn more about this procedure and looked into the code-files. I may be totally wrong here as I don't have much experience with the C++ code underlying OpenMx. But to my understanding, an R translation of the lasso-functions would look as follows (see penalty.cpp, lines 46 and following):

	penaltyStrength <- function(absPar, epsilon, smoothProportion){
	  active = epsilon
	  width = active * smoothProportion
	  inactive = active - width
	  if (absPar > active) return (1)
	  if (absPar < inactive) return (0)
	  return ((absPar - inactive) / width)
	}
	
	# I separate the lasso penalty value and 
	# the lasso gradient value in two functions
	LassoPenalty <- function(parameters, lambda, scale,
	                         epsilon, smoothProportion)
	{
	  
	  tmp = 0
	  for (px in 1:length(parameters)) {
	    par = abs(parameters[px] / scale[px])
	    tmp = tmp + penaltyStrength(par, epsilon, smoothProportion) * par
	  }
	  return(tmp * lambda)
	}
	
	LassoGradient <- function(parameters,
	                          gradients, 
	                          lambda, 
	                          scale,
	                          epsilon, 
	                          smoothProportion)
	{
	  tmp = 0
	  for (px in 1:length(parameters)) {
	    par = abs(parameters[px] / scale[px])
	    gradients[px] = gradients[px] + 
	      penaltyStrength(par,
	                      epsilon, 
	                      smoothProportion) *
	      sign(parameters[px])*lambda
	  }
	  return(gradients)
	}

If I am not mistaken, the current implementation does result in a penalty function which has more non-differentiable points than the lasso - four instead of one:

	parValues <- seq(-1.5e-5,1.5e-5,length.out = 1000)
	epsilon <- 1e-5
	smoothProportion <- .05
	lambda <- 1
	
	penValue <- rep(NA, length(parValues))
	
	for(i in 1:length(parValues)){
	  penValue[i] <- LassoPenalty(parameters = parValues[i], 
	                              lambda = lambda, 
	                              scale = 1, 
	                              epsilon = epsilon, 
	                              smoothProportion = smoothProportion)
	}
	plot(parValues, penValue, type = "l")

Also, in the gradient-function it seems as if the contribution of the penalty term is set to zero close to the origin:

	gradValue <-  rep(NA, length(parValues))
	
	# for simplicity, we will use a parabola as gradient of
	# the log-Likelihood:
	
	likelihoodGradients <- 1e9*(parValues-.5e-5)^2
	
	for(i in 1:length(parValues)){
	  gradValue[i] <- LassoGradient(gradients = likelihoodGradients[i],
	                                parameters = parValues[i], 
	                                lambda = lambda, 
	                                scale = 1, 
	                                epsilon = epsilon, 
	                                smoothProportion = smoothProportion)  
	  
	}
	plot(parValues, gradValue, type = "l")
	lines(parValues, likelihoodGradients, col = "blue")

Note that at a parameter value of zero, typically sub-gradients would be used which include [-1,1]. If the gradients of the penalty function are set to zero, all that is left is the gradient of the log-likelihood. This gradient will be non-zero in most cases and therefore the optimizer may have difficulties spotting the actual minimum.

An alternative could be to approximate lasso penalty as proposed by Lee et al. (2006; see epsL1 on p. 405). A simple implementation could look something like this:

	// NOTE: For simplicity, the "scale" of the parameters is not taken into account in the following. 
	// eps refers to the parameter epsilon used by Lee et al. (2006; see epsL1 on p. 405) 
	
	void LassoPenalty::compute(int want, FitContext *fc)
	{
	  double lambda = getHP(fc, 0);
		if (want & FF_COMPUTE_FIT) {
	    double tmp = 0;
	    for (int px = 0; px < params.size(); ++px) {
	      double par = std::sqrt(std::pow(fc->est[ params[px] ],2) + eps);
	      // eps controls the smoothness. If eps > 0, the lasso penalty is always differentiable
	      tmp += par; 
	    }
	    matrix->data[0] = tmp * lambda;
	  }
	  if (want & FF_COMPUTE_GRADIENT) {
	    for (int px = 0; px < params.size(); ++px) {
	      double par = fc->est[ params[px] ];
	      fc->gradZ[ params[px] ] += lambda * (par/std::sqrt(std::pow(par,2) + eps));
	      // eps controls the smoothness. If eps > 0, the lasso penalty is always differentiable
	    }
	  }
	}

In R, this would look as follows:

	epsL1Lasso <- function(parameters, lambda, eps
	){
	  return(lambda*sum(sqrt((parameters)^2 + eps)))
	}
	
	epsL1LassoGradient <- function(parameters, 
	                               gradients, lambda, eps){
	  gradients <- gradients + lambda*
	    parameters*
	    (1/sqrt((parameters)^2 + eps))
	  return(gradients)
	}
	
	for(i in 1:length(parValues)){
	  penValue[i] <- epsL1Lasso(parameters = parValues[i], 
	                            lambda = lambda, 
	                            eps = 1e-4)
	  gradValue[i] <- epsL1LassoGradient(parameters = parValues[i], 
	                                     gradients = likelihoodGradients[i],
	                                     lambda = lambda, 
	                                     eps = 1e-10)
	}
	# the penalty values are now smooth
	plot(parValues, penValue, type = "l")
	# and the gradients as well:
	plot(parValues, gradValue, type = "l")
	lines(parValues, likelihoodGradients, col = "blue")

This implementation may come with other limitations, however. In my experience, a specialized optimizer for lasso penalty functions often returns better results.

Best,
Jannik

Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. Y. (2006). Efficient L1 Regularized Logistic Regression. Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), 401–408.

The text was updated successfully, but these errors were encountered:

jpritikin · 2022-06-03T18:09:30Z

That's a nice improvement. Can you make your proposal the default and rename the current lasso to "oldLasso"?

jhorzek · 2022-06-04T07:48:48Z

Awesome! I will give it a try, but it may take some time as I am not familiar enough with the OpenMx code base yet and don't want to break anything. I'll keep you updated.

jpritikin · 2022-06-04T15:30:08Z

I'm happy to answer any questions or review the change.

jhorzek · 2022-06-07T18:18:02Z

I've created some preliminary versions of the alternative penalty functions in a fork (https://github.com/jhorzek/OpenMx/tree/penaltyFunctions). Do you already have scripts to test all penalty functions which I could use to test my implementation?

jpritikin · 2022-06-07T19:28:04Z

See git grep -l mxPenalty. I find:

inst/models/passing/Acemix2.R
inst/models/passing/jointFactorWls.R
inst/models/passing/regularize.R
vignettes/reg_mimic.Rmd
vignettes/regularization.Rmd

jpritikin · 2022-06-07T19:34:26Z

This is exactly the wrong approach:

-imxPenaltyTypes <- c('lasso', 'ridge', 'elasticNet')
+imxPenaltyTypes <- c('lasso', 'ridge', 'elasticNet',
+                     'smoothLasso', 'smoothRidge', 'smoothElasticNet')

We should use your new implementation by default. So move the current implementation to oldLasso, oldRidge, etc

The naming in the C++ code can stay the same though since that's not user visible.

Otherwise, looks good.

jpritikin · 2022-06-07T19:35:49Z

Also, add yourself as a contributor (ctb) in DESCRIPTION.in.

jhorzek · 2022-06-08T08:25:59Z

I did not know about the git grep function - thank you for the info! I wanted to leave the current implementation of the penalty functions untouched until I am fairly certain that my implementation works. I'll change all names to the ones you suggested before creating a pull request, however.

jhorzek · 2023-05-23T10:27:36Z

I am sorry for leaving this issue open for so long. When implementing regularized SEM with non-smooth penalty functions, I found that the combination of fit indices and smoothed penalties can be problematic.
Basically what happens is that none of the parameters will be exactly zero due to the smoothing. Afterwards, information criteria are typically used to select the best model among those fitted with different penalty values. Here, we reward the sparsity of the parameter vector; that is, we count the number of non-zero parameters. Because none of the parameters will be exactly zero when smoothing the penalty, it seems that the current implementation in OpenMx uses a threshold below which parameters are set to zero (this is implemented here). The degrees of freedom are then computed based on this threshold (see here).
I've used the same procedure in my own implementations of regularized SEM, but found that the theshold parameter can have a substantial influence on the final parameter estimates. Therefore, I decided to switch to specialized optimzers that take the non-differentiability into account. I've implemented such optimizers in the lessOptimizers repository for the lessSEM package that I've used for an in-depth comparison of smoothed and non-smoothed penalty functions (see here).
An alternative procedure is to adapt the computation of the degrees of freedom. Such procedures are, for instance, used by Geminiani et al. (2021). Alternatively, cross-validation could be used instead of information criteria. Here, no rounding is required because we don't reward the sparsity of the parameter vector.
Unfortunately, all three procedures - implementing specialized optimizers, adapting the degrees of freedom computation, or implementing cross-validation by default - require changes to the code beyond what I can do at the moment. For instance, the optimizers in lessOptimizers use Armadillo instead of Eigen and therefore cannot be used directly in OpenMx. But they may provide a starting point to implement specialized optimziers for OpenMx as well.
If there is anything I can help with, please let me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-smoothness of mxPenalty functions #343

Non-smoothness of mxPenalty functions #343

jhorzek commented Jun 3, 2022

jpritikin commented Jun 3, 2022

jhorzek commented Jun 4, 2022

jpritikin commented Jun 4, 2022

jhorzek commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jhorzek commented Jun 8, 2022

jhorzek commented May 23, 2023

Non-smoothness of mxPenalty functions #343

Non-smoothness of mxPenalty functions #343

Comments

jhorzek commented Jun 3, 2022

jpritikin commented Jun 3, 2022

jhorzek commented Jun 4, 2022

jpritikin commented Jun 4, 2022

jhorzek commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jpritikin commented Jun 7, 2022

jhorzek commented Jun 8, 2022

jhorzek commented May 23, 2023