Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-smoothness of mxPenalty functions #343

Open
jhorzek opened this issue Jun 3, 2022 · 9 comments
Open

Non-smoothness of mxPenalty functions #343

jhorzek opened this issue Jun 3, 2022 · 9 comments

Comments

@jhorzek
Copy link

jhorzek commented Jun 3, 2022

Dear all,

I am very excited to see that OpenMx now also supports regularization of parameters with lasso and similar penalty functions! If I understand it correctly, OpenMx uses a smooth approximation of the non-differentiable penalty functions; the documentation states: "Smoothing is controlled by smoothProportion. If smoothProportion is zero then the traditional discontinuous functions are used. Otherwise, smoothProportion of the region between epsilon and zero is used for smoothing." I wanted to learn more about this procedure and looked into the code-files. I may be totally wrong here as I don't have much experience with the C++ code underlying OpenMx. But to my understanding, an R translation of the lasso-functions would look as follows (see penalty.cpp, lines 46 and following):

	penaltyStrength <- function(absPar, epsilon, smoothProportion){
	  active = epsilon
	  width = active * smoothProportion
	  inactive = active - width
	  if (absPar > active) return (1)
	  if (absPar < inactive) return (0)
	  return ((absPar - inactive) / width)
	}
	
	# I separate the lasso penalty value and 
	# the lasso gradient value in two functions
	LassoPenalty <- function(parameters, lambda, scale,
	                         epsilon, smoothProportion)
	{
	  
	  tmp = 0
	  for (px in 1:length(parameters)) {
	    par = abs(parameters[px] / scale[px])
	    tmp = tmp + penaltyStrength(par, epsilon, smoothProportion) * par
	  }
	  return(tmp * lambda)
	}
	
	LassoGradient <- function(parameters,
	                          gradients, 
	                          lambda, 
	                          scale,
	                          epsilon, 
	                          smoothProportion)
	{
	  tmp = 0
	  for (px in 1:length(parameters)) {
	    par = abs(parameters[px] / scale[px])
	    gradients[px] = gradients[px] + 
	      penaltyStrength(par,
	                      epsilon, 
	                      smoothProportion) *
	      sign(parameters[px])*lambda
	  }
	  return(gradients)
	}

If I am not mistaken, the current implementation does result in a penalty function which has more non-differentiable points than the lasso - four instead of one:

	parValues <- seq(-1.5e-5,1.5e-5,length.out = 1000)
	epsilon <- 1e-5
	smoothProportion <- .05
	lambda <- 1
	
	penValue <- rep(NA, length(parValues))
	
	for(i in 1:length(parValues)){
	  penValue[i] <- LassoPenalty(parameters = parValues[i], 
	                              lambda = lambda, 
	                              scale = 1, 
	                              epsilon = epsilon, 
	                              smoothProportion = smoothProportion)
	}
	plot(parValues, penValue, type = "l")

Also, in the gradient-function it seems as if the contribution of the penalty term is set to zero close to the origin:

	gradValue <-  rep(NA, length(parValues))
	
	# for simplicity, we will use a parabola as gradient of
	# the log-Likelihood:
	
	likelihoodGradients <- 1e9*(parValues-.5e-5)^2
	
	for(i in 1:length(parValues)){
	  gradValue[i] <- LassoGradient(gradients = likelihoodGradients[i],
	                                parameters = parValues[i], 
	                                lambda = lambda, 
	                                scale = 1, 
	                                epsilon = epsilon, 
	                                smoothProportion = smoothProportion)  
	  
	}
	plot(parValues, gradValue, type = "l")
	lines(parValues, likelihoodGradients, col = "blue")

Note that at a parameter value of zero, typically sub-gradients would be used which include [-1,1]. If the gradients of the penalty function are set to zero, all that is left is the gradient of the log-likelihood. This gradient will be non-zero in most cases and therefore the optimizer may have difficulties spotting the actual minimum.

An alternative could be to approximate lasso penalty as proposed by Lee et al. (2006; see epsL1 on p. 405). A simple implementation could look something like this:

	// NOTE: For simplicity, the "scale" of the parameters is not taken into account in the following. 
	// eps refers to the parameter epsilon used by Lee et al. (2006; see epsL1 on p. 405) 
	
	void LassoPenalty::compute(int want, FitContext *fc)
	{
	  double lambda = getHP(fc, 0);
		if (want & FF_COMPUTE_FIT) {
	    double tmp = 0;
	    for (int px = 0; px < params.size(); ++px) {
	      double par = std::sqrt(std::pow(fc->est[ params[px] ],2) + eps);
	      // eps controls the smoothness. If eps > 0, the lasso penalty is always differentiable
	      tmp += par; 
	    }
	    matrix->data[0] = tmp * lambda;
	  }
	  if (want & FF_COMPUTE_GRADIENT) {
	    for (int px = 0; px < params.size(); ++px) {
	      double par = fc->est[ params[px] ];
	      fc->gradZ[ params[px] ] += lambda * (par/std::sqrt(std::pow(par,2) + eps));
	      // eps controls the smoothness. If eps > 0, the lasso penalty is always differentiable
	    }
	  }
	}

In R, this would look as follows:

	epsL1Lasso <- function(parameters, lambda, eps
	){
	  return(lambda*sum(sqrt((parameters)^2 + eps)))
	}
	
	epsL1LassoGradient <- function(parameters, 
	                               gradients, lambda, eps){
	  gradients <- gradients + lambda*
	    parameters*
	    (1/sqrt((parameters)^2 + eps))
	  return(gradients)
	}
	
	for(i in 1:length(parValues)){
	  penValue[i] <- epsL1Lasso(parameters = parValues[i], 
	                            lambda = lambda, 
	                            eps = 1e-4)
	  gradValue[i] <- epsL1LassoGradient(parameters = parValues[i], 
	                                     gradients = likelihoodGradients[i],
	                                     lambda = lambda, 
	                                     eps = 1e-10)
	}
	# the penalty values are now smooth
	plot(parValues, penValue, type = "l")
	# and the gradients as well:
	plot(parValues, gradValue, type = "l")
	lines(parValues, likelihoodGradients, col = "blue")

This implementation may come with other limitations, however. In my experience, a specialized optimizer for lasso penalty functions often returns better results.

Best,
Jannik

Lee, S.-I., Lee, H., Abbeel, P., & Ng, A. Y. (2006). Efficient L1 Regularized Logistic Regression. Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), 401–408.

@jpritikin
Copy link
Member

That's a nice improvement. Can you make your proposal the default and rename the current lasso to "oldLasso"?

@jhorzek
Copy link
Author

jhorzek commented Jun 4, 2022

Awesome! I will give it a try, but it may take some time as I am not familiar enough with the OpenMx code base yet and don't want to break anything. I'll keep you updated.

@jpritikin
Copy link
Member

I'm happy to answer any questions or review the change.

@jhorzek
Copy link
Author

jhorzek commented Jun 7, 2022

I've created some preliminary versions of the alternative penalty functions in a fork (https://github.com/jhorzek/OpenMx/tree/penaltyFunctions). Do you already have scripts to test all penalty functions which I could use to test my implementation?

@jpritikin
Copy link
Member

See git grep -l mxPenalty. I find:

inst/models/passing/Acemix2.R
inst/models/passing/jointFactorWls.R
inst/models/passing/regularize.R
vignettes/reg_mimic.Rmd
vignettes/regularization.Rmd

@jpritikin
Copy link
Member

This is exactly the wrong approach:

-imxPenaltyTypes <- c('lasso', 'ridge', 'elasticNet')
+imxPenaltyTypes <- c('lasso', 'ridge', 'elasticNet',
+                     'smoothLasso', 'smoothRidge', 'smoothElasticNet')

We should use your new implementation by default. So move the current implementation to oldLasso, oldRidge, etc

The naming in the C++ code can stay the same though since that's not user visible.

Otherwise, looks good.

@jpritikin
Copy link
Member

Also, add yourself as a contributor (ctb) in DESCRIPTION.in.

@jhorzek
Copy link
Author

jhorzek commented Jun 8, 2022

I did not know about the git grep function - thank you for the info! I wanted to leave the current implementation of the penalty functions untouched until I am fairly certain that my implementation works. I'll change all names to the ones you suggested before creating a pull request, however.

@jhorzek
Copy link
Author

jhorzek commented May 23, 2023

I am sorry for leaving this issue open for so long. When implementing regularized SEM with non-smooth penalty functions, I found that the combination of fit indices and smoothed penalties can be problematic.
Basically what happens is that none of the parameters will be exactly zero due to the smoothing. Afterwards, information criteria are typically used to select the best model among those fitted with different penalty values. Here, we reward the sparsity of the parameter vector; that is, we count the number of non-zero parameters. Because none of the parameters will be exactly zero when smoothing the penalty, it seems that the current implementation in OpenMx uses a threshold below which parameters are set to zero (this is implemented here). The degrees of freedom are then computed based on this threshold (see here).
I've used the same procedure in my own implementations of regularized SEM, but found that the theshold parameter can have a substantial influence on the final parameter estimates. Therefore, I decided to switch to specialized optimzers that take the non-differentiability into account. I've implemented such optimizers in the lessOptimizers repository for the lessSEM package that I've used for an in-depth comparison of smoothed and non-smoothed penalty functions (see here).
An alternative procedure is to adapt the computation of the degrees of freedom. Such procedures are, for instance, used by Geminiani et al. (2021). Alternatively, cross-validation could be used instead of information criteria. Here, no rounding is required because we don't reward the sparsity of the parameter vector.
Unfortunately, all three procedures - implementing specialized optimizers, adapting the degrees of freedom computation, or implementing cross-validation by default - require changes to the code beyond what I can do at the moment. For instance, the optimizers in lessOptimizers use Armadillo instead of Eigen and therefore cannot be used directly in OpenMx. But they may provide a starting point to implement specialized optimziers for OpenMx as well.
If there is anything I can help with, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants