Skip to content

Latest commit

 

History

History
4031 lines (2991 loc) · 255 KB

Numerical Optimization.md

File metadata and controls

4031 lines (2991 loc) · 255 KB

Numerical Optimization

IN A Few Useful Things to Know about Machine Learning, Pedro Domingos put up a relation: $\color{aqua}{LEARNING}$ = $\color{green}{REPRESENTATION}$ + $\color{blue}{EVALUATION}$ + $\color{red}{OPTIMIZATION}$.

  • Representation as the core of the note is the general (mathematical) model that computer can handle.
  • Evaluation is criteria. An evaluation function (also called objective function, cost function or scoring function) is needed to distinguish good classifiers from bad ones.
  • Optimization is aimed to find the parameters that optimizes the evaluation function, i.e. $$ \arg\min_{\theta\in \Theta} f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\min f(\theta)},\text{or} \ \quad\arg\max_{\theta\in \Theta}f(\theta)={\theta^{\ast}|f(\theta^{\ast})=\max f(\theta)}. $$

The objective function to be minimized is also called cost function.

Evaluation is always attached with optimization; the evaluation which cannot be optimized is not a good evaluation in machine learning.

The proof of convergence or complexity is often based on the convex cases where the objective function as well as the constrained set is convex, i.e., $$t x+(1-t)y\in\Theta,\ f(t x+(1-t)y)\leq t f(x)+(1-t)f(y),\ \quad t\in [0,1], \quad \forall x, y\in\Theta.$$

And this optimization is called convex optimization. By the way, the name numerical optimization means the variables or parameters to estimate are in numeric format, which is far from performance optimization in concept.

First optimal condition is a necessary condition for the unconstrainted optimziation problems if the objective function is differentiable: if $\nabla f$ exists, and $x^{\ast}\in\arg\min_{x} f(x)$, we have $\nabla f(x^{\ast})=0$.


Wotao Yin wrote a summary on First-order methods and operator splitting for optimization:

First-order methods are described and analyzed with gradients or subgradients, while second-order methods use second-order derivatives or their approximations.

During the 70s–90s the majority of the optimization community focused on second-order methods since they are more efficient for those problems that have the sizes of interest at that time. Beginning around fifteen years ago, however, the demand to solve ever larger problems started growing very quickly. Many large problems are further complicated by non-differentiable functions and constraints. Because simple first-order and classic second-order methods are ineffective or infeasible for these problems, operator splitting methods regained their popularity.

Operators are used to develop algorithms and analyze them for a wide spectrum of problems including optimization problems, variational inequalities, and differential equations. Operator splitting is a set of ideas that generate algorithms through decomposing a problem that is too difficult as a whole into two or more smaller and simpler subproblems. During the decomposition, complicated structures like non-differentiable functions, constraint sets, and distributed problem structures end up in different subproblems and thus can be handled elegantly. We believe ideas from operator splitting provide the most effective means to handle such complicated structures for computational problem sizes of modern interest.

Gradient Descent and More

Each iteration of a line search method computes a search direction $p^{k}$ and then decides how far to move along that direction. The iteration is given by

$$ x^{k+1}=x^{k}+\alpha_{k}p^{k}\tag{Line search} $$

where the positive scalar $\alpha_{k}$ is called the step length. The success of a line search method depends on effective choices of both the direction $p^{k}$ and the step length $\alpha_{k}$.

$\color{lime}{Note}$: we use the notation $x^{k}$ and $\alpha_k$ to represent the $k$th iteration of the vector variables $x$ and $k$ th step length, respectively. Most line search algorithms require $p^k$ to be a descent direction — one for which $\left< {p^k},\nabla f_k \right> < 0$ — because this property guarantees that the function $f$ can be reduced along this direction, where $\nabla f_k$ is the gradient of objective function $f$ at the $k$th iteration point $x_k$ i.e. $\nabla f_k=\nabla f(x^{k})$.


Gradient descent and its variants are to find the local solution of the unconstrained optimization problem:

$$ \min_{x} f(x) $$

where $x\in \mathbb{R}^{n}$.

It is not difficult to observe that $$f(x) \approx f(x^k)+(x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}_2^2$$ by Taylor expansion of $f$ near the point $x^{k}$ for some constant $s_k$.

Let $x^{k+1}=\arg\min_{x} f(x^k) + (x - x^k)^{T}\nabla f(x^{k}) + \frac{s_k}{2}{|x-x^k|}2^2$, we will obtain $x^{k+1} = x^{k}-\frac{1}{s_k}{\nabla}{x} f(x^k)$. However, the constant $s_k$ is difficult to estimate.

And the general gradient descent methods are given by

$$ x^{k+1}=x^{k}-\alpha_{k}{\nabla}_{x} f(x^k) $$

where $x^{k}$ is the $k$th iterative result, $\alpha_{k}\in{\alpha|f(x^{k+1})< f(x^{k})}$ and particularly $\alpha_{k}=\arg\min_{\alpha}f(x^{k}-\alpha\nabla_{x}f(x^{k}))$ so that $f(x^{k+1})=\min_{\alpha} f(x^k - \alpha\nabla_x f(x^k))$.

$$ x^{k+1}=\fbox{$x^{k}$}-\alpha_{k}{\nabla}{x} f(x^k) = \fbox{$x^{k-1}-\alpha{k-1}\nabla f(x_{k-1})$}-\alpha_{k}{\nabla}{x} f(x^k)\ = x^1-\sum{n=0}^{k}\alpha_n{\nabla}_{x} f(x^n) $$

There are many ways to choose some proper step or learning rate sequence ${\alpha_k}$.


The first-order Taylor approximation of $f(x + v)$ around ${x}$ is $$f(x+v)\approx \hat{f}(x+v)=f(x)+\nabla_x f(x)^T v.$$

The second term on the righthand side, $\nabla f(x)^T v$, is the directional derivative of ${f}$ at ${x}$ in the direction ${v}$. It gives the approximate change in ${f}$ for a small step ${v}$. The step ${v}$ is a descent direction if the directional derivative is negative. Since the directional derivative $\nabla_x f(x)^T v$ is linear in ${v}$, it can be made as negative as we like by taking ${v}$ large (provided ${v}$ is a descent direction, i.e., $\nabla_x f(x)^T v< 0$). To make the question sensible we have to limit the size of ${v}$, or normalize by the length of ${v}$.

We define a normalized steepest descent direction (with respect to the norm $|\cdot |$ in $\mathbb{R}^n$) as

$$\Delta x_{nsd}=\arg\min_{v}{f(x)^T v\mid |v|=1}.$$

It is also convenient to consider a steepest descent step $\Delta x_{sd}$ that is unnormalized, by scaling the normalized steepest descent direction in a particular way:

$$\Delta x_{sd}={|\nabla f(x)|}{\ast}\Delta x{nsd}$$

where ${| \cdot |}_{\ast}$ denotes the dual norm.

(We say ‘a’ steepest descent direction because there can be multiple minimizers.)


Algorithm Steepest descent method

  • given a starting point $x \in domf$.
  • repeat
    1. Compute steepest descent direction $\Delta x_{sd}$.
    2. Line search. Choose ${t}$ via backtracking or exact line search. 3. Update. $x := x + t \Delta x_{sd}$.
  • until stopping criterion is satisfied.

If the variable ${x}$ is restricted in some bounded domain, i.e., $x\in D$, the steepest gradient descemt methods can be modified to conditional gradient descent method or Frank-Wolfe algorithm.


Algorithm Frank–Wolfe algorithm

  • given a starting point $x \in domf$.
  • repeat
    1. Find $s^k$: $s^k =\arg\min_{v}{f(x^{k})^T v\mid v\in D }$.
    2. Set $t=\frac{2}{k+2}$ or $t=\arg\min{f(x^k+t(s^k - x^k))\mid t\in [0, 1]}$
    3. Update. $x^{k+1}= x^k + t(s^k-x^k), k\leftarrow k+1$.
  • until stopping criterion is satisfied.


Some variants of gradient descent methods are not line search method. For example, the heavy ball methods or momentum methods:

$$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) $$

where the momentum coefficient $\rho_k\in[0,1]$ generally and the step length $\alpha_k$ cannot be determined by line search. We can unfold the recursion defining momentum gradient descent and write: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}\underbrace{(x^k-x^{k-1})}{\text{denoted as $\Delta{k}$ }}\ \Delta_{k+1}= -\alpha_k\nabla_{x}f(x^k)+ \rho_{k}\Delta_{k}\ \ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\sum_{i=1}^{k-1}{ - \prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{red}{x}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

Nesterov accelerated gradient method at the $k$th step is given by:

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \qquad \text{Descent} \\ y^{k+1}=x^{k}+\rho_{k}(x^{k}-x^{k-1}) \qquad \text{Momentum} $$

where the momentum coefficient $\rho_k\in[0,1]$ generally.

$$ x^{k}=y^{k}-\alpha_{k+1}\nabla_{x}f(y^k) \\ y^{k}=x^{k-1}+\rho_{k-1}(x^{k-1}-x^{k-2})\\ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(y^k)+\sum_{i=1}^{k-1}{ -\prod_{j=0}^{i}\rho_{k-j} }\alpha_{k-j-1}\nabla f(\underbrace{\color{green}{y}^{k-i}}_{\triangle})+\rho_1\Delta_1. $$

They are called as inertial gradient methods or accelerated gradient methods. And there are some different forms.

Inventor of Nesterov accelerated Gradient

Conjugate Gradient Methods

It is from the methods to solve the Symmetric positive definite linear systems $Ax=b, A=A^T\succ 0 \in\mathbb{R}^{n\times n}$. $x^{\ast}=A^{-1}b$ is the solution and $x^{\ast}$ minimizes (convex function) $f(x)=\frac{1}{2}x^T Ax - b^T x$ thus $\nabla f(x) = Ax − b$ is gradient of $f$. $r = b − Ax=-\nabla f(x) = -A(x-x^{\ast})$ is called the residual at $x$.

The solution is to make the search directions-orthogonal instead of orthogonal. Two vectors $u$ and $v$ are $A$ -orthogonal, or conjugate, i.e., $$u^TAv=0.$$

Krylov subspace $\cal K_k=\operatorname{span}{b, Ab, \cdots, A^{k-1}b}$ and we define the Krylov sequence $$ x^{k} =\arg\min_{x\in\cal K_k} f(x)=\arg\min_{x\in\cal K}{|x-x^{\ast}|}_A^2$$

the CG algorithm (among others) generates the Krylov sequence.

  • $f\left(x^{(k+1)}\right) \leq f\left(x^{(k)}\right)$ but $|r|$ can increase.
  • $x^{(n)}=x^{\star}\left(i.e., x^{\star} \in \mathcal{K}{n}\right)$ even when $\mathcal{K}{n} \neq \mathbf{R}^{n}$
  • $x^{(k)}=p_{k}(A) b,$ where $p_{k}$ is a polynomial with $\operatorname{deg} p_{k}<k$.

There is a two-term recurrence $$\fbox{$x^{k+1} = x^k + \alpha_k (x^k-x^{\ast}) + \beta_k (x^{k}-x^{k-1})$}$$

$$ {x :=0, r :=b, \rho_{0} :={|r|}^2 } \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w := Ap} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x := r+\alpha p} \ {r := r-\alpha w} \ {\rho_k := {|r|}^2} $$

Preconditioned conjugate gradient

with preconditioner $M \approx A^{-1}$ (hopefully) $$ \begin{array}{l} {x :=0, r :=b-A x_{0}, p :=r z :=M r, \rho_{1} :=r^{T} z} \ {\text { for } k=1, \ldots, N_{\text {max }}} \ {\begin{array}{l}{\text { quit if } \sqrt{\rho_{k}} \leq \epsilon {|b|}{2} \text {or} |r| \leq \epsilon{|b|}{2}} \ {w :=\frac{\rho_{k}}{w^{T} p}} \ {\alpha :=\frac{\rho_{k}}{w^{T} p}} \ {x :=r+\alpha p} \ {r :=r-\alpha w} \ {z :=M r} \ \rho_{k+1} :=z^{T} r\ {p := z+\frac{\rho_{k+1}}{\rho_{k}} p}\end{array}} \end{array} $$

The gradient descent methods transforms the multiply optimization to univariate optimization.

Here is an outline of the nonlinear CG method:

  • $d^{0}=r^{0}=-f^{\prime}(x^0)$
  • Find $\alpha^{i}$ that minimizes $f(x^{i}+\alpha^{i} d^{i})$,
  • $x^{i+1}=x^{i} + \alpha_{i} d^{i}$
  • $r^{i+1}=-f^{\prime}(x^{i+1})$
  • $\beta_{i+1}=\frac{\left<r^{i+1}, r^{i+1}\right>}{\left<r^{i}, r^{i}\right>}$ or $\beta_{i+1}=\max{\frac{\left<r^{i+1}, r^{i+1}-r^{i}\right>}{\left<r^{i}, r^{i}\right>}, 0}$
  • $d^{i+1}=r^{i+1}+\beta_{i+1}d^{i}$

Variable Metric Methods

Newton's Methods

NEWTON’S METHOD and QUASI-NEWTON METHODS are classified to variable metric methods.

It is also to find the solution of unconstrained optimization problems, i.e. $$\min f(x)$$ where $x\in \mathbb{R}^{n}$. Specially, the objective function $f(x)$ is convex so that the local minima is global minima.


If ${x^{\star}}$ is the extrema of the cost function $f(x)$, it is necessary that $\nabla f(x^{\star}) = 0$. So if we can find all the solution of the equation system $\nabla f(x) = 0$, it helps us to find the solution to the optimization problem $\arg\min_{x\in\mathbb{R}^n} f(x)$.

Newton's method is one of the fixed-point methods to solve nonlinear equation system.

It is given by $$ x^{k+1}=\arg\min_{x}f(x^k)+(x-x^k)^T \nabla_x f(x^k)+\frac{1}{2\alpha_{k+1}}(x-x^k)^T H(x^k)(x-x^k) \=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} $$ where $H^{-1}(x^{k})$ is inverse of the Hessian matrix of the function $f(x)$ at the point $x^{k}$. It is called Newton–Raphson algorithm in statistics. Especially when the log-likelihood function $\ell(\theta)$ is well-behaved, a natural candidate for finding the MLE is the Newton–Raphson algorithm with quadratic convergence rate.

$$ x^{k+1}=x^{k}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})}\\ = x^{k-1}-\alpha_{k}H^{-1}(x^{k-1})\nabla_{x},{f(x^{k-1})}-\alpha_{k+1}H^{-1}(x^{k})\nabla_{x},{f(x^{k})} \\ =x^{1}-\sum_{n=1}^{k} \alpha_{n+1}H^{-1}(x^{n})\nabla_{x},{f(x^{n})} $$

The Fisher Scoring Algorithm

In maximum likelihood estimation, the objective function is the log-likelihood function, i.e. $$ \ell(\theta)=\sum_{i=1}^{n}\log{P(x_i|\theta)} $$ where $P(x_i|\theta)$ is the probability of realization $X_i=x_i$ with the unknown parameter $\theta$. However, when the sample random variable ${X_i}_{i=1}^{n}$ are not observed or realized, it is best to replace negative Hessian matrix (i.e. -$\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}$) of the likelihood function with the observed information matrix: $$ J(\theta)=\mathbb{E}({\color{red}{-1}}\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}) =\text{$\color{red}{-}$}\int\frac{\partial^2\ell(\theta)}{\partial\theta\partial\theta^{T}}f(x_1, \cdots, x_n|\theta)\mathrm{d}x_1\cdots\mathrm{d}x_n $$ where $f(x_1, \cdots, x_n|\theta)$ is the joint probability density function of $X_1, \cdots, X_n$ with unknown parameter $\theta$.

And the Fisher scoring algorithm is given by $$ \theta^{k+1}=\theta^{k}+\alpha_{k}J^{-1}(\theta^{k})\nabla_{\theta} \ell(\theta^{k}) $$ where $J^{-1}(\theta^{k})$ is the inverse of observed information matrix at the point $\theta^{k}$.

See http://www.stats.ox.ac.uk/~steffen/teaching/bs2HT9/scoring.pdf or https://wiseodd.github.io/techblog/2018/03/11/fisher-information/.

Fisher scoring algorithm is regarded as an example of Natural Gradient Descent in information geometry as shown in https://wiseodd.github.io/techblog/2018/03/14/natural-gradient/ and https://www.zhihu.com/question/266846405.

Quasi-Newton Methods

Quasi-Newton methods, like steepest descent, require only the gradient of the objective function to be supplied at each iterate. By measuring the changes in gradients, they construct a model of the objective function that is good enough to produce super-linear convergence. The improvement over steepest descent is dramatic, especially on difficult problems. Moreover, since second derivatives are not required, quasi-Newton methods are sometimes more efficient than Newton’s method.[^11]

In optimization, quasi-Newton methods (a special case of variable-metric methods) are algorithms for finding local maxima and minima of functions. Quasi-Newton methods are based on Newton's method to find the stationary point of a function, where the gradient is 0. In quasi-Newton methods the Hessian matrix does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead. Quasi-Newton methods are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multiple dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian. One of the chief advantages of quasi-Newton methods over Newton's method is that the Hessian matrix (or, in the case of quasi-Newton methods, its approximation) $\mathbf{B}$ does not need to be inverted. The Hessian approximation $\mathbf{B}$ is chosen to satisfy $$ \nabla f(x^{k+1})=\nabla f(x^{k})+B(x^{k+1}-x^{k}), $$ which is called the $\fbox{secant equation}$ (the Taylor series of the gradient itself). In more than one dimension B is underdetermined. In one dimension, solving for B and applying the Newton's step with the updated value is equivalent to the secant method. The various quasi-Newton methods differ in their choice of the solution to the secant equation (in one dimension, all the variants are equivalent).

The unknown $x_{k}$ is updated applying the Newton's step calculated using the current approximate Hessian matrix $B_{k}$:

  1. ${\displaystyle \Delta x_{k}=-\alpha_{k} B_{k}^{-1}\nabla f(x_{k})}$, with $\alpha$ chosen to satisfy the Wolfe conditions;
  2. $x_{k+1}=x_{k}+\Delta x_{k}$;
  3. The gradient computed at the new point $\nabla f(x_{k+1})$, and $y_{k}=\nabla f(x_{k+1})-\nabla f(x_{k})$ is used to update the approximate Hessian $B_{k+1}$, or directly its inverse $H_{k+1} = B_{k+1}^{-1}$ using the Sherman–Morrison formula.

A key property of the BFGS and DFP updates is that if $B_{k}$ is positive-definite, and ${\alpha}{k}$ is chosen to satisfy the Wolfe conditions, then $B{k+1}$ is also positive-definite.

For example,

Method $\displaystyle B_{k+1}=$ $H_{k+1}=B_{k+1}^{-1}=$
DFP $(I-\frac{y_k \Delta x_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) B_k (I-\frac{ \Delta x_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}} \Delta x_k}) + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}} \Delta x_k$ $H_k +\frac{\Delta x_k \Delta x_k^T}{\Delta x_k^T y_k} -\frac{H_k y_ky_k^T H_k}{y_K^T H_K y_k}$
BFGS $B_k + \frac{y_k y_k^{\mathrm{T}}}{y_k^{\mathrm{T}}\Delta x_k} - \frac{B_k\Delta x_k(B_k\Delta x_k)^T}{\Delta x_k B_k \Delta x_k}$ $(I-\frac{ \Delta x_k^{\mathrm{T}} y_k}{ y_k^{\mathrm{T}} \Delta x_k}) H_k (I-\frac{y_k \Delta x_k^{\mathrm{T}}}{ y_k^{\mathrm{T}} \Delta x_k}) + \frac{\Delta x_k \Delta x_k^T}{y_k^T \Delta x_k}$
SR1 $B_{k} + \frac{(y_{k} - B_{k},\Delta x_{k} )(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T}} }{(y_{k} - B_{k},\Delta x_{k})^{\mathrm{T} },\Delta x_{k}}$ $H_{k} + \frac{(\Delta x_{k}-H_{k}y_{k}) (\Delta x_{k} -H_{k} y_{k})^{\mathrm{T}} }{(\Delta x_{k}-H_{k}y_{k})^{\mathrm {T} }y_{k}}$

The Barzilai-Borwein method

Consider the gradient iteration form $$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$

which can be written as $$ x^{k+1}=x^{k}- D_k \nabla f(x^k) $$ where $D_k = \alpha_k I$. In order to make the matrix $D_k$ have quasi-Newton property, we compute $\alpha_k$ such that $$ \min |s_{k-1}-D_k y_{k-1}| $$ which yields

$$ \alpha_k =\frac{\left<s_{k-1},y_{k-1}\right>}{\left<y_{k-1},y_{k-1}\right>}\tag 1 $$

where $s_{k-1}= x_k-x_{k-1}, y_{k-1}=\nabla f(x^k)-\nabla f(x^{k-1})$.

By symmetry, we may minimize $|D_k^{-1}s_{k-1}- y_{k-1}|$ with respect to $\alpha_k$ and get

$$ \alpha_k = \frac{\left<s_{k-1}, s_{k-1}\right>}{\left<s_{k-1}, y_{k-1}\right>}.\tag 2 $$

In short, the iteration formula of Barzilai-Borwein method is given by

$$ x^{k+1}=x^{k}-\alpha_k \nabla f(x^k) $$ where $\alpha_k$ is determined by (1) or (2).

It is easy to see that in this method no matrix computations and no line searches (except $k = 0$) are required.

L-BFGS

The BFGS quasi-newton approximation has the benefit of not requiring us to be able to analytically compute the Hessian of a function. However, we still must maintain a history of the $s_n$ and $y_n$ vectors for each iteration.

L-BFGS algorithm builds and refines quadratic model of a function being optimized. Algorithm stores last M value/gradient pairs and uses them to build positive definite Hessian approximation. This approximate Hessian matrix is used to make quasi-Newton step. If quasi-Newton step does not lead to sufficient decrease of the value/gradient, we make line search along direction of this step.

Essential feature of the algorithm is positive definiteness of the approximate Hessian. Independently of function curvature (positive or negative) we will always get SPD matrix and quasi-Newton direction will always be descent direction.

Another essential property is that only last $M$ function/gradient pairs are used, where M is moderate number smaller than problem size N, often as small as 3-10. It gives us very cheap iterations, which cost just O(N·M) operations.

Gauss–Newton Method

A form of regression where the objective function is the sum of squares of nonlinear functions: $$f(x)=\frac{1}{2}\sum_{j=1}^{m}(r_j(x))^2=\frac{1}{2}\sum_{j=1}^{m}{|r(x)|}^2$$ where The j-th component of the m-vector $r(x)$ is the residual $r_j(x)=(\phi(x_j;t_j)-y_j)^2$.

The Gauss-Newton Method generalizes Newton’s method for multiple dimensions with approximate Hessian matrix $\nabla^2 f_k \approx J_k^T J_k$.

The basic steps that the software will perform (note that the following steps are for a single iteration):

  • Make an initial guess $x^0$ for $x$,
  • Make a guess for $k = 1$,
  • Create a vector $f^k$ with elements $f_i(x^k)$,
  • Create a Jacobian matrix for $J_k$
  • Solve ($J^T_k J_k p_k = -J^T_k f_k$).
  • This gives you the probabilities $p$ for all $k$.
  • Find $s$. $F(x^k + s p_k)$ should satisfy the Wolfe conditions (these prove that step-lengths exist).
  • Set $x^{k+1} = x^k + sp^k$.
  • Repeat Steps 1 to 7 until convergence.

Natural Gradient Descent

The generalized natural gradient is the direction of steepest ascent in Riemannian space, which is invariant to parametrization, and is defined: $$\nabla S\propto\lim_{\epsilon\to 0} \arg\max_{d:B(P_{\theta}, P_{\theta+d})=\epsilon} S(\theta+d, y)$$

where $B(P_{\theta}, P_{\theta+d})$ is the Bregman divergence of the probabilities $P_{\theta}, P_{\theta + d}$ and $S$ is a loss function on the probability $P_{\theta + d}$ and $y$ such as log-likelihood function.

Natural gradient descent is to solve the optimization problem $\min_{\theta} L(\theta)$ by $$ \theta^{(t+1)}=\theta^{(t+1)}-\alpha_{(t)}F^{-1}(\theta^{(t)})\nabla_{\theta}L(\theta^{(t)}) $$ where $F^{-1}(\theta^{(t)})$ is the inverse of Remiann metric at the point $\theta^{(t)}$. And Fisher scoring algorithm is a typical application of Natural Gradient Descent to statistics.
Natural gradient descent for manifolds corresponding to exponential families can be implemented as a first-order method through mirror descent.

Originator of Information Geometry: Shunichi Amari

Higher-Order Derivatives Methods

Higher-order methods, such as Newton, quasi-Newton and adaptive gradient descent methods, are extensively used in many scientific and engineering domains. At least in theory, these methods possess several nice features: they exploit local curvature information to mitigate the effects of ill-conditioning, they avoid or diminish the need for hyper-parameter tuning, and they have enough concurrency to take advantage of distributed computing environments. Researchers have even developed stochastic versions of higher-order methods, that feature speed and scalability by incorporating curvature information in an economical and judicious manner. However, often higher-order methods are “undervalued.”

The key observation, which underlies all results of this paper, is that an appropriately regularized Taylor approximation of convex function is a convex multivariate polynomial. This is indeed a very natural property since this regularized approximation usually belongs to the epigraph of convex function. Thus, the auxiliary optimization problem in the high-order (or tensor) methods becomes generally solvable by many powerful methods of Convex Optimization.

The most well-known third order optimization method may be Halley's method(Tangent Hyperbolas Method).

Trust Region Methods

Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. If a step is not acceptable, they reduce the size of the region and find a new minimizer. In general, the direction of the step changes whenever the size of the trust region is altered.

$$ \min_{p\in\mathbb{R}^n} m_k(p)=f_k+ g_k^T p+\frac{1}{2}p^T B_k p, , , s.t. |p|\leq {\Delta}_k, $$ where $f_k=f(x^k), g_k = \nabla f(x^k)$;$B_k$ is some symmetric matrix as an approximation to the Hessian $\nabla^2 f(x^k + tp)$; $\Delta_k&gt;0$ is the trust-region radius. In most of our discussions, we define $|\cdot|$ to be the Euclidean norm. Thus, the trust-region approach requires us to solve a sequence of subproblems in which the objective function and constraint are both quadratic. Generally, we set $B_k = \nabla^2 f(x^k)$.

Expectation Maximization Algorithm

Expectation-Maximization algorithm, popularly known as the EM algorithm has become a standard piece in the statistician’s repertoire. It is used in incomplete-data problems or latent-variable problems such as Gaussian mixture model in maximum likelihood estimation. The basic principle behind the EM is that instead of performing a complicated optimization, one augments the observed data with latent data to perform a series of simple optimizations.

It is really popular for Bayesian statistician.

Let $\ell(\theta|Y_{obs})\stackrel{\triangle}=\log{L(\theta | Y_{obs})}$ denote the log-likelihood function of observed datum $Y_{obs}$. We augment the observed data $Y_{obs}$ with latent variables $Z$ so that both the complete-data log-likelihood $\ell(\theta | Y_{obs}, Z)$ and the conditional predictive distribution $f(z|Y_{obs}, \theta)$ are available.

Each iteration of the EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ \begin{align} Q(\theta|\theta^{(t)}) & \stackrel{\triangle}= \mathbb{E}(\ell(\theta | Y_{obs}, Z) |Y_{obs}, \theta^{(t)}) \ &= \int_{Z}\ell(\theta | Y_{obs}, Z)\times f(z | Y_{obs}, \theta^{(t)})\mathrm{d}z, \end{align} $$

and the M-step is to maximize Q with respect to $\theta$ to obtain

$$ \theta^{(t+1)}=\arg\max_{\theta} Q(\theta|\theta^{(t)}). $$

Diagram of EM algorithm

Generalized EM Algorithm

Each iteration of the generalized EM algorithm consists of an expectation step (E-step) and a maximization step (M-step) Specifically, let $\theta^{(t)}$ be the current best guess at the MLE $\hat\theta$. The E-step is to compute the Q function defined by $$ Q(\theta|\theta^{(t)}) = \mathbb{E}[ \ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)} ] \ = \int_{Z}\ell(\theta|Y_{obs}, Z)\times f(z|Y_{obs}, \theta^{(t)})\mathrm{d}z, $$ and the another step is to find $\theta$ that satisfies $Q(\theta^{t+1}|\theta^{t})&gt;Q(\theta^{t}|\theta^{t})$, i.e. $$ \theta^{(t+1)}\in {\hat{\theta}|Q(\hat{\theta}|\theta^{(t)} \geq Q(\theta|\theta^{(t)}) }. $$

It is not to maximize the conditional expectation.

See more on the book The EM Algorithm and Extensions, 2nd Edition by Geoffrey McLachlan , Thriyambakam Krishna.

Projected Gradient Method and More

We will focus on projected gradient descent and its some non-Euclidean generalization in order to solve some simply constrained optimization problems.

If not specified, all these methods are aimed to solve convex optimization problem with explicit constraints, i.e. $$ \arg\min_{x\in\mathbb{S}}f(x) $$

where $f$ is convex and differentiable, $\mathbb{S}\subset\mathbb{R}^n$ is convex. The optimal condition for this constrained optimization problem is that any feasible direction is not descent direction: if $x^{\star}\in\mathbb{S}$ is the solution to the problem, we can assert that variational inequality holds: $$\forall x \in \mathbb{S}, \left&lt;\nabla f(x^{\star}),x-x^{\star} \right&gt; \geq 0.$$

And it is the optimal condition of constrained optimization problem.

We say the gradient $\nabla f$ of the convex function $f$ is a monotone operator because $$\left&lt;x-y, \nabla f(x)-\nabla f(y)\right&gt;\geq 0. $$

Projected Gradient Descent

Projected gradient descent has two steps: $$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Descent}\ x^{k+1} = Proj_{\mathbb{S}}(z^{k+1})=\arg\min_{x\in \mathbb{S}}|x-z^{k+1}|^{2} \qquad\text{Projection} $$

or in the compact form $$ x^{k+1} = \arg\min_{x\in \mathbb{S}}|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} \= \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$

where $\delta_{\mathbb{S}}$ is the indictor function of the set $\mathbb{S}$ $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

Each projection is an optimization so that the iterative points satisfy the optimal conditions, which also restricts the projection method into the case where the projection is available or simple to compute. And it is natural to search better Descent step or Projection step.

The following links are recommended if you are interested in more theoretical proof:

For the non-differentiable but convex function such as the absolute value function $f(x)=|x|$, we therefore consider sub-gradients in place of the gradient, where sub-gradient $\phi$ at the point $\hat{x}$ is defined as the elements in the domain of convex function ${f}$ (denoted as $D_f$), satisfying

$$ \left<\phi, x-\hat{x}\right>\leq f(x) -f(\hat{x}),\forall x \in D_f. $$

Mirror descent

Mirror descent can be regarded as the non-Euclidean generalization via replacing the $\ell_2$ norm or Euclidean distance in projected gradient descent by Bregman divergence.

Bregman divergence is induced by convex smooth function ${h}$:

$$ B_h(x,y) = h(x) - h(y)-\left<\nabla h(y),x-y\right> $$

where $\left&lt;\cdot,\cdot\right&gt;$ is inner product and it also denoted as $D_h$.

The function ${h}$ is usually required to be strongly convex. And if the convex function ${h}$ is not differentiable, one element of the sub-gradient ${\partial h(y)}$ may replace the gradient $\nabla h(y)$.

It is convex in $x$ and $\frac{\partial B_h(x, y)}{\partial x} = \nabla h(x) - \nabla h(y)$.

Especially, when ${h}$ is quadratic function, the Bregman divergence induced by $h$ is $$ B_h(x, y)=x^2-y^2-\left<2y,x-y\right>=x^2+y^2-2xy=(x-y)^2 $$ i.e. the square of Euclidean distance.

A wonderful introduction to Bregman divergence is Meet the Bregman Divergences by Mark Reid at http://mark.reid.name/blog/meet-the-bregman-divergences.html.

The Bregman projection onto a convex set $C\subset \mathbb{R}^n$ given by $$ y^{\prime}= \arg\min_{x\in C} B(x,y) $$ is unique.

A generalized Pythagorean theorem holds: for convex $C\subset \mathbb{R}^n$ and for all $x\in C$ and $y\in \mathbb{R}^n$ we have $$B(x,y)\geq B(x,y^{\prime}) + B(y^{\prime},y)$$ where $y^{\prime}$ is the Bregman projection of ${y}$, and equality holds when the convex set C defining the projection $y^{\prime}$ is affine.


It is given in the projection form:

$$ z^{k+1} = x^{k}-\alpha_k\nabla_x f(x^{k}) \qquad \text{Gradient descent};\\ x^{k+1} = \arg\min_{x\in\mathbb{S}}B_h(x,z^{k+1}) \qquad\text{Bregman projection}. $$

In another compact form, mirror gradient can be described in the proximal form: $$ x^{k+1} = \arg\min_{x\in\mathbb{S}} { f(x^k) + \left<g^k, x-x^k\right> + \frac{1}{\alpha_k} B_h(x,x^k)}\tag {1} $$ with $g^k=\nabla f(x^k)$.

Note that the next iteration point $x^{k+1}$ only depends the current iteration point $x^{k}$ no more previous iteration points.

By the optimal conditions of equation (1), the original "mirror" form of mirror gradient method is described as $$ \nabla h(x^{k+1}) = \nabla h(x^k) - \alpha_k \nabla f(x^k), \ = \nabla h(x^1) - \sum_{n=1}^{k}\alpha_n \nabla f(x^n) , x\in \mathbb{S}, $$ where the convex function ${h}$ induces the Bregman divergence.

One special method is called entropic mirror descent(Exponential Gradient Descent) when the Bregman divergence induced by $e^x$ and the constraint set $\mathbb{S}\subset\mathbb{R}^n$ is simplex, i.e. $\sum_{i=1}^{n}x_i =1, \forall x_i \geq 0$.

Entropic descent method at step ${k}$ is given as follows:

$$ {x_{i}^{k+1} = \frac{x_i^{k}\exp(-\alpha \nabla {f(x^k)}{i})}{\sum{j=1}^{n} x_j^{k}\exp(-\alpha \nabla {f(x^k)}{j})}}, i=1,2,\dots, n. $$ it is obvious that entropic decscent methods are in the coordinate-wise update formula. Whast is more , it can be rewritten as $$x^{k+1}=\frac{x^{1}\exp(\sum{n=1}^{k}-\alpha \nabla f(x^n))}{\prod_{n=1}^{k}\left<x^n, \exp(-\alpha \nabla f(x^n))\right>}\propto x^{1}\exp(\sum_{n=1}^{k}-\alpha \nabla f(x^n)).$$

Multiplicative Weights Update is closely related with entropic descent method. See more on the following link list.

Proximal Gradient Method

Recall the projected gradient method,
it converts the constrained problem $\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ into unconstrained (also uncontinuous) problem $\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$ literally and it follows the following iteration: $$ x^{k+1} = \arg\min_{x}{|x-(x^{k}-\alpha_k\nabla_x f(x^{k}))|^{2} +\delta_{\mathbb{S}}(x) } $$ where the function $\delta_{\mathbb{S}}(x)$ is obviously non-differentiable, defined as follows $$ h(x) =\delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}. \end{cases} $$

If the function $\delta_{\mathbb{S}}(x)$ is replaced by general non-differential while convex function $\mathbf{h}(x)$, the proximal mapping (or prox-operator) of a convex function $\mathbf{h}$ is defined as $$ prox_h(x)=\arg\min_{u}{\mathbf{h}(u)+\frac{1}{2} {|x-u|}_2^2} $$

Unconstrained problem with cost function split in two components: $$minimize \qquad f(x) = g(x)+\mathbf{h}(x)$$

  • ${g}$ is convex, differentiable;
  • $\mathbf{h}$ is closed, convex, possibly non-differentiable while $prox_h(x)$ is inexpensive.

Proximal gradient algorithm $$x^{k}=prox_{t_k h} {x^{k-1}-t_k\nabla g(x^{k-1})}$$

$t_k &gt; 0$ is step size, constant or determined by line search.

$$x^+ = prox_{th}(x-t\nabla g(x))$$ from the definition of proximal mapping: $$ x^+ = \arg\min_{u}( h(u)+\frac{1}{2}{|u-(x-t\nabla g(x))|}2^2 ) \= \arg\min{u}( h(u) + g(x) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}2^2 ) \= \arg\min{u}( h(u) + \nabla g(x)^T(u-x) +\frac{1}{2t} {|u-x|}_2^2 ) $$

$x^{+}$ minimizes $h(u)$ plus a simple quadratic local model of $g(u)$ around ${x}$.

And projected gradient method is a special case of proximal gradient method.

And it is natural to consider a more general algorithm by replacing the squared Euclidean distance in definition of proximal mapping with a Bregman distance: $$ \arg\min_{u}{h(u)+\frac{1}{2} {|x-u|}2^2}\to \arg\min{u}{h(u)+ B(x,u)}. $$ so that the primary proximal gradient methods are modified to the Bregman version, which is called as Bregman proximal gradient method.

Proximal and Projected Newton Methods

A fundamental difference between gradient descent and Newton’s method was that the latter also iteratively minimized quadratic approximations, but these used the local Hessian of the function in question.

There are many such methods (Ellipsoid method, AdaGrad, ... ) with projection carried out in the $H_k$ metric to solve $\min_{x\in X}f(x)$:

  1. get subgradient $g^{k}\in\partial f(x^K)$
  2. update (diagonal) metric $H_k$
  3. update $x^{k+1}=\operatorname{Proj}_{X}^{H_K}{|x^k - H_k^{-1}g^k|}_2^2$

where $\operatorname{Proj}{X}^{H_K}(y)=\arg\min{x\in X}{|x-y|}{H_k}^2$ and ${|x-y|}{H_K}^2=(x-y)^TH_k(x-y)$.

Proximal Newton method can be also applied to such optimization problem: $$ y^{k} = prox_{H_{k-1}}(x^{k-1}-H_{k-1}^{-1}\nabla g(x^{k-1}))\ x^{k} = x^{k-1}+t_{k}(y^{k}-x^{k-1}). $$

Douglas–Rachford method

$$\min_{x}{f(x)+g(x)}$$ where $f$ and $g$ are closed convex functions. Douglas–Rachford iteration: start at any $y^0$ and repeat for $k = 0, 1, \cdots,$ $$ x^{k+1} = prox_f(y^k)\\ y^{k+1} = y^k + prox_g(2x^{k+1} − y^k) − x^{k+1} $$

  • useful when $f$ and $g$ have inexpensive prox-operators;
  • $x^k$ converges to a solution of $0 \in \partial f (x) + \partial g(x)$ (if a solution exists);
  • not symmetric in $f$ and $g$.

The iteration can be written as fixed-point iteration $$y^{k+1} = F(y^k)$$ where $F(y)=y + prox_g(2prox_f(y) − y) − prox_f(y)$.

Proximal point algorithms and its Beyond

Now, let us consider the simple convex optimization $$\min_{x}{\theta(x)+f(x)\mid x \in\mathcal X},\tag{2.1}$$ where $\theta(x)$ and $f(x)$ are convex but $\theta(x)$ is not necessary smooth, $\mathcal X$ is a closed convex set. For solving (2.1), the k-th iteration of the proximal point algorithm (abbreviated to PPA) begins with a given $x^k$, offers the new iterate $x^{k+1}$ via the recursion $$x^{k+1}=\arg\min_{x}{\theta(x)+f(x)+\frac{r}{2}{|x-x^k|}_2^2\mid x \in\mathcal X}.\tag{2.2}$$

Since $x^{k+1}$ is the optimal solution of (2.2), it follows from optimal condition that $$\theta(x)-\theta(x^{k+1})+(x-x^{k+1})^{T}{ \nabla f(x^{k+1}) + r( f(x^{k+1}) - f(x^k)) } \quad\forall x\in\mathcal X.$$

It is proved that The sequence ${x^k}$ generated by PPA is Fejer **monotone**, i.e., $|x^{k+1}-x^{\ast}|^2\leq |x^{k}-x^{\ast}|^2-|x^{k+1}-x^{k}|^2$.

$\color{quad}{Note}$: the materials in this section is taken from the lectures of Bingshneg He.

Prediction-Correction Methods

The optimal condition of the linearly constrained convex optimization is characterized as a mixed monotone variational inequality:

$$w^{\ast}\in\Omega, \theta(u)-\theta(u^{\ast})+(w - w^{\ast})^{T} \nabla F(w^{\ast}) \quad\forall w\in\mathcal \Omega.$$


[Prediction Step.] : With given $v^k$, find a vector $\tilde{w}^k\in \Omega$ such that $$\theta(u)-\theta(\tilde u^{k})+(w - \tilde w^{k})^{T} \nabla F(w^{\ast})\geq (v-\tilde v^{k})^T Q (v-\tilde v^{k}),\quad w\in\Omega$$ where the matrix $Q$ is not necessary symmetric, but $Q^T+Q$ is positive definite.

[Correction Step.] : The new iterate $v^{k+1}$ by $$v^{k+1} =v^{k}- \alpha M(v^{k}-\tilde v^{k+1}).$$


Convergence Conditions: For the matrices $Q$ and $M$ above, there is a positive definite matrix $H$ such that $$HM = Q. \tag{2.20a}$$ Moreover, the matrix $$G = Q^T + Q − \alpha M^T H M \tag{2.20b}$$ is positive semi-definite.

Customized PPA

Now, let us consider the simple convex optimization $$\min_{u}{\theta(u)\mid Au=b, u \in\mathcal U}.\tag{2.3}$$

The related variational inequality of the saddle point of the Lagrangian function is $$w^{\ast}\in \Omega, \theta(u) - \theta(u^{\ast})+(w-w^{\ast})^T F(w^{\ast})\geq 0 \quad\forall w\in\Omega$$ where $w=(u, \lambda)^T, F(w)=(-A^T\lambda, Au-b)^T$.

For given $v^k = w^k = (u^k, \lambda^k)$, the predictor is given by $$ \begin{aligned} \tilde{u}^{k} &=\arg \min \left{L\left(u, \lambda^{k}\right)+\frac{r}{2}\left|u-u^{k}\right|^{2} | u \in \mathcal{U}\right}, \ \tilde{\lambda}^{k} &=\arg \max \left{L\left(\left[2 \tilde{u}^{k}-u^{k}\right], \lambda\right)-\frac{s}{2}\left|\lambda-\lambda^{k}\right|^{2}\right}. \end{aligned} $$


$\color{aqua}{Note}$: the projection from a point $x^0$ into a subset $C\subset\mathbb{R}^n$ is defined in proximal operator as

$$ x^+ =\arg\min_{x}{\delta_C(x)+\frac{1}{2}{|x-x^0|}_2^2} $$

while it can also written in the following form: $$ x^{+}=\arg\min_{x}{\frac{1}{1_C(x)}\cdot {|x-x^0|}_2^2} $$

where $$ 1_C(x)= \begin{cases} 1, & \text{if} \quad x \in C,\ 0, & \text{otherwise}. \end{cases} $$ How we can generalize this form into the proximal form? And what is the difference with the original addition proximal operator?

If $x^0$ is in the set ${C}$, the projection can be rewritten in proximal operator: $$ x^{+}=\arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}_2^2}. $$

How we can generalize the function $\delta_{C}$ to more general convex function? What are the advantages of this generalization? As likelihood and log-likelihood, we can transfer the product of some functions to sum by taking logarithm.

$$ x^{+} = \arg\min_{x}{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \ = \arg\min{x}\log{\exp[\delta_C(x)]\cdot \frac{1}{2}{|x-x^0|}2^2} \= \arg\min{x} {\delta_C(x)+\log({|x-x^0|}_2^2)}. $$

Penalty/Barrier Function Methods

In projected gradient method, it converts the constrained problem into an unsmooth problem.


constrained problem unconstrained problem desired properties
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ $\arg\min_{x}{f(x)+\delta_{\mathbb{S}}(x)}$ unsmooth, inexpensive
$\arg\min_{x}{f(x)\mid x\in \mathbb{S}}$ $\arg\min_{x}{f(x) + g_{\mathbb{S}}(x)}$ smooth, inexpensive
$\arg\min_{x}{f(x)\mid g(x)\in \mathbb{S}}$ $\arg\min_{x}{f(x)+\fbox{?}}$ smooth, inexpensive

In constrained optimization, a field of mathematics, a barrier function is a continuous function whose value on a point increases to infinity as the point approaches the boundary of the feasible region of an optimization problem.

Recall the indicator function of constrained set in projected methods: $$ \delta_{\mathbb{S}}(x)= \begin{cases} 0, & \text{if} \quad x \in \mathbb{S};\ \infty, & \text{otherwise}, \end{cases} $$ there is a dramatic leap between the boundary of $\mathbb S$ and the outer.

The basic idea of the barrier method is to approximate the indicator function by some functions ${f}$ satisfying:

  • convex and differentiable;
  • $f(x)&lt; \infty$ if $x\in\mathbb S$;
  • for every point constraint on the boundary $f(x)\to\infty$ as $x\to\partial\mathbb S$.

For example, Barrier Method in Convex Optimization: Fall 2018 pushs the inequality constraints in the problem in a smooth way, reducing the problem to an equality-constrained problem. Consider now the following minimization problem: $$ \min_{x} f(x) \ \text{subject to} \quad h_n \leq 0, n=1,2,\cdots, N \ Ax=b $$ where $f, h_1, h_2,\cdots, h_N$ are assumed to be convex and twice differential functions in $\mathbb R^p$. The log-barrier function is defined as $$\phi(x)=-\sum_{n=1}^{N}\log(-h_n(x)).$$

When we ignore equality constraints, the problem above can be written by incorporating the inequality with the identity function, which in turn can be approximated by the log-barrier functions as follows $$\min_{x} f(x)+\sum_{n=1}^{N}\mathbb{I}{h_n(x)\leq 0}\approx \min{x} f(x)+\frac{1}{t}\phi(x)$$

for $t &gt; 0$ being a large number.

A penalty method replaces a constrained optimization problem by a series of unconstrained problems whose solutions ideally converge to the solution of the original constrained problem. The unconstrained problems are formed by adding a term, called a penalty function, to the objective function that consists of a penalty parameter multiplied by a measure of violation of the constraints. The measure of violation is nonzero when the constraints are violated and is zero in the region where constraints are not violated.

In the context of the equality-constrained problem $$\min_{x} f(x), ,, s.t. c_n(x)=0, n = 1,2,\cdots, N,$$

the quadratic penalty function $Q(x;\mu)$ for this formulation is $$Q(x;\mu)=f(x)+\frac{\mu}{2}\sum_{n=1}^{N}c_n^2(x)$$ where $\mu&gt;0$ is the penalty parameter. By driving $\mu\to\infty$, we penalize the constraint violations with increasing severity.

Quadratic Penalty Method

  • Given $\mu_0 &gt; 0$, a nonnegative sequence ${\tau_k}$ with $\tau_k\to 0$, and a starting point $x^s_0$;
  • for $k = 0, 1, 2, . . .$:
    • Find an approximate minimizer $x^k=\arg\min_{x}Q(x; \mu_k)$, starting at $x^s_k$ and terminating when ${|\nabla_x Q(x;\mu_k)|}_2\leq \tau_k$;
    • if final convergence test satisfied,
      • stop with approximate solution $x^k$;
    • end if
    • Choose new penalty parameter $\mu_{k+1} &gt; \mu_{k}$;
    • Choose new starting point $x^s_{k+1}$
  • end (for)

Path Following Methods

The main ideas of path following by predictor–corrector and piecewise-linear methods, and their application in the direction of homotopy methods and nonlinear eigenvalue problems are reviewed. Further new applications to areas such as polynomial systems of equations, linear eigenvalue problems, interior methods for linear programming, parametric programming and complex bifurcation are surveyed. Complexity issues and available software are also discussed.

Lagrange Multipliers and Duality

It is to solve the constrained optimization problem $$\arg\min_{x}f(x), \quad s.t.\quad g(x)=b.$$

The barrier or penalty function methods are to add some terms to $f(x)+\Omega(g(x)-b)$ converting constrained problems into unconstrained problems by introducing an artificial penalty for violating the constraint. For example, $$P(x, \lambda)= f(x) + \lambda {|g(x)-b|}_2^2$$ where the penalty function $\Omega(x) = {|x|}_2^2$, $\lambda\in\mathbb{R}^{+}$. We can regard it as a surrogate loss technique. Although the penalty function is convex and differentiable, it is more difficult than directly optimizing $f(x)$ when the constrain is complicated.

Lagrange Multipliers and Generalized Lagrange Function

The penalty function methods do not take the optimal conditions into consideration although it works. If $x^{\star}$ is in the solution set of the optimization problem above, it is obvious that $Ax^{\star}=b$ and $L(x^{\star}, \lambda)=f(x^{\star})$ where $$L(x, \lambda)= f(x) + \lambda^T(Ax-b).$$

In another direction, we want to prove that $x^{\star}$ is the optima of the optimization problem if $$ L(x^{\star}, {\lambda}^{\star})=\quad \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

By the definition, $L(x^{\star}, {\lambda}^{\star})\geq L(x^{\star}, \lambda)=f(x^{\star})+\lambda^T(Ax^{\star}-b)$, which implies that $\lambda^T(Ax^{\star}-b)=0$ i.e., $Ax^{\star}=b$. And $L(x^{\star}, {\lambda}^{\star}) = f(x^{\star})\leq L(x, {\lambda}^{\star})\forall x$ if $Ax=b$ thus $x^{\star}$ is the solution to the primary problem.

It is dual problem is in the following form: $$\max_{\lambda}[\min_{x} L(x, \lambda)].$$

Note that $$ \min_{x} L(x, \lambda)\leq L(x,\lambda)\leq \max_{\lambda}L(x,\lambda) $$ implies $$\ \max_{\lambda}[\min_{x} L(x, \lambda)]\leq \max_{\lambda} L(x, \lambda)\leq \min_{x}[\max_{\lambda} L(x, \lambda)]. $$

And note that necessary condition of extrema is that the gradients are equal to 0s: $$ \frac{\partial L(x,\lambda)}{\partial x}= \frac{\partial f(x)}{\partial x}+ A\lambda = 0 \ \frac{\partial L(x,\lambda)}{\partial \lambda} = Ax-b=0 $$

Dual Ascent takes advantages of this properties:

  1. $x^{k+1}=\arg\min_{x} L(x,\lambda)$;
  2. ${\lambda}^{k+1}= {\lambda}^{k}+\alpha_k(Ax^{k+1}-b)$.

Entropy minimization algorithms: $$ x^{k+1}\in \arg\min_{x}{f(x)+\frac{1}{c_k} \sum_{i=1}^{n} x^{i}(\ln(\frac{x_i}{x^{k}_{i}})-1)}. $$

Exponential Augmented Lagrangian Method

A special case for the convex problem $$ \text{minimize}\quad f(x),\ s.t. g_1(x) \leq 0, g_2(x) \leq 0, \cdots, g_r(x) \leq 0, x\in X $$

is the exponential augmented Lagrangean method.

It consists of unconstrained minimizations:

$$ x^{k}\in \arg\min_{x\in X}{f(x)+\frac{1}{c_k} \sum_{j=1}^{r} {\mu}^{k}{j}\exp(c_k g_j(x)) $$ followed by the multiplier iterations $$ {\mu}^{k+1}{j} = {\mu}^{k}_{j}\exp(c_k g_j(x^k)). $$

If the constraints are more complex, KKT theorem may be necessary.

Generalized Lagrangian function

Minimize $f(x)$, subject to $g_i(x) \ge 0, i = 1,\cdots, m$. Here, $x \in\Omega$, and $\Omega$ is a subset of the Euclidean space $E$. We assume that $f(x)$ and $g_i(x)$ are twice continuously differentiable. A generalized Lagrangian function is defined as $$L(x, \sigma) = f(x) - G[g(x), \sigma]. $$

KKT condition

Given general problem $$ \begin{aligned} \min_{x \in \mathbb{R}^{n}} f(x) & \ \text { subject to } & h_{i}(x) \leq 0, \quad i=1, \ldots m \ & \ell_{j}(x)=0, \quad j=1, \ldots r \end{aligned} $$

We defined the Lagrangian: $$L(x, u, v) = f(x) +\sum_{i=1}^{m}u_ih_i(x)+\sum_{j=1}^r v_j \ell_j.$$

The Karush-Kuhn-Tucker conditions or KKT conditions are: $$\begin{aligned} &\bullet\quad 0 \in \partial f(x)+\sum_{i=1}^{m} u_{i} \partial h_{i}(x)+\sum_{j=1}^{T} v_{j} \partial \ell_{j}(x) &\text{(stationarity)} \ &\bullet\quad u_{i} \cdot h_{i}(x)=0 \text { for all } i &\text{ (complementary slackness) } \ &\bullet\quad h_{i}(x) \leq 0, \ell_{j}(x)=0 \text { for all } i, j &\text{(primal feasibility) } \ &\bullet\quad u_{i} \geq 0 \text{ for all } i &\text{ (dual feasibility) } \end{aligned}$$

I learnt this theorem in functional analysis at graduate level course.

Conjugate Duality

Consider the standard form convex optimization problem in the absence of data uncertainty $$\min_{x} f(x)\tag{P}$$

where $f$ is a proper lower semi-continuous convex functions.

This problem can be embedded into a family of parameterized problems $$\min_{x} g(x, y)\tag{$P_y$}$$ where the function $g(x, y)$ satisfies $g(x, 0) = f(x)$.

Splitting Methods

In following context we will talk the optimization problem with linear constraints: $$ \arg\min_{x} f(x) \ s.t. Ax = b $$ where $f(x)$ is always convex.

ADMM

Alternating direction method of multipliers is called ADMM shortly. It is aimed to solve the following convex optimization problem: $$ \min F(x,y) {=f(x)+g(y)} \ \text{subject to }\quad Ax+By =b $$ where $f(x)$ and $g(y)$ is convex; ${A}$ and ${B}$ are matrices.

Define the augmented Lagrangian:

$$ L_{\beta}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By -b)+ \frac{\beta}{2}{|Ax + By - b|}_{2}^{2}. $$


Augmented Lagrange Method at step $k$ is described as following:

  1. $(x^{k+1}, y^{k+1})=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y,\lambda^{\color{aqua}{k}});$
  2. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

ADMM at step $t$ is described as following:

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

The convergence proof of ADMM in convex optimization can be reduced to verifying the stability of a dynamical system or based on the optimal condition variational inequality like On the O(1/t) convergence rate of alternating direction method.

Linearized ADMM

Note that the $x$ subproblem in ADMM $$ \begin{align} \arg\min_{x}L_{\beta}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})\ =\arg\min_{x}{f(x)+g(y^k)+{\lambda^k}^{T}(Ax+By^{k}-b)+\frac{\beta}{2}{|Ax+By^{k}-b|}{2}^{2}} \ =\arg\min{x}f(x)+\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda|}_{2}^{2}\tag 1 \end{align} $$

However, the solution of the subproblem (1) does not have the closed form solution because of the general structure of the matrix ${A}$. In this case, we linearize the quadratic term of $$\frac{\beta}{2}{|Ax+By^{k}-b-\frac{1}{\beta}\lambda^k|}_{2}^{2}$$

at $x^k$ and add a proximal term $\frac{r}{2}{|x-x^k|}2^2$ to the objective function. In another word, we solve the following ${x}$ subproblem if ignoring the constant term of the objective function: $$ \min{x}f(x)+\beta(Ax)^T(A x^k + B y^k - b-\frac{1}{\lambda^k}) + \frac{r}{2}{|x - x^k|}_2^2. $$

  1. $x^{k+1}=\arg\min_{x\in \mathbf{X}} f(x) + \beta(A x)^T (A x^k + B y^k - b -\frac{1}{\lambda^k})+ \frac{r}{2}{|x - x^k|}_2^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}( x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}} )$,
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}} - b).$

For given $\beta &gt; 0$, choose ${r}$ such that the matrix $rI_{1}-\beta A^TA$ is definitely positive, i.e., $$rI_{1}-\beta A^TA\geq 0.$$

We can also linearize the ${y}$ subproblem:

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^k, \lambda^k)$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \beta (By)^{T}(Ax^{\color{red}{k+1}} + B y^k - b -\frac{1}{\beta}\lambda^k) + \frac{r}{2}|y - y^k|^2$,
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

For given $\beta &gt; 0$, choose ${r}$ such that the matrix $rI_{2}-\beta B^T B$ is definitely positive, i.e., $$rI_{2}-\beta B^T B\geq 0.$$


Taking $\mu\in(0, 1)$ (usually $\mu=0.9$), the Symmetric ADMM is described as

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x, y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}})$,
  2. $\lambda^{k + \frac{1}{2}} = \lambda^{k} - \mu\beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k}}-b)$,
  3. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k+\frac{1}{2}}})$,
  4. $\lambda^{k+1} = \lambda^{\color{red}{k+\frac{1}{2}}} - \mu\beta (A x^{\color{red}{k+1}} + B y^{\color{red}{k+1}}-b)$.

$\color{aqua}{\text{Thanks to Professor He Bingsheng who taught me those.}}$[^9]

He Bingsheng


One of the particular ADMM is also called Split Bregman methods. And Bregman ADMM replace the quadratic penalty function with Bregman divergence: $$ L_{\beta}^{\phi}(x, y)=f(x)+g(y) - \lambda^{T}(Ax + By - b) + \frac{\beta}{2} B_{\phi}(b- Ax, By). $$

where $B_{\phi}$ is the Bregman divergence induced by the convex function $\phi$.

BADMM

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{\phi}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{\phi}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

Relaxed, Inertial and Fast ADMM

$$\min_{x}f(x)+g(Ax)$$ The penalty parameter is $\rho &gt; 0$ and the relaxation parameter is $\alpha\in (0, 2)$. Standard ADMM is recovered with $\alpha = 1$.

Family of relaxed A-ADMM algorithms for the above problem

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax- y^k+ \lambda^k|}^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)y^k-y+\lambda^k|}^2$,
  3. $\lambda^{k+1} = \lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)y^k-z^{\color{red}{k+1}}$.

Family of relaxed ADMM algorithms for the above problem The damping constant is $r \geq 3$.

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}f(x)+\frac{\rho}{2}{|Ax-\hat y^k+\lambda^k|}^2$,
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} g(y) + \frac{\rho}{2}{|\alpha Ax^{\color{red}{k+1} }+(1-\alpha_k)\hat y^k-y+\lambda^k|}^2$,
  3. $\lambda^{k+1} = \hat\lambda^{k} +\alpha Ax^{\color{red}{k+1}}+(1-\alpha_k)\hat y^k-z^{\color{red}{k+1}}$,
  4. $\gamma_{k+1}=\frac{k}{k+r}$,
  5. $\hat\lambda^{k+1}=\lambda^{k+1}+\gamma_{k+1}(\lambda^{k+1}-\lambda^{k})$,
  6. $\hat y^{k+1}=y^{k+1}+\gamma_{k+1}(y^{k+1}-y^{k})$.

Fast ADMM

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}(x,\hat y^{\color{aqua}{k}},\hat\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}(x^{\color{red}{k+1}}, y, \hat\lambda^{\color{aqua}{k}});$
  3. $\lambda_{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$
  4. $\alpha_{k+1}=\frac{1+\sqrt{1+4\alpha_k^2}}{2}$
  5. $\hat y^{k+1}=y^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(y^{k}-y^{k-1})$
  6. $\hat \lambda^{k+1}=\lambda^{k+1}+\frac{\alpha_{k}-1}{\alpha_{k+1}}(\lambda^{k}-\lambda^{k-1})$

Multi-Block ADMM

Firstly we consider the following optimization problem

$$ \min f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)\\ s.t.\quad A_1x_1 + A_2x_2 + \cdots + A_n x_n = b, \\ x_i\in\mathop{X_i}\in\mathbb{R}^{d_i}, i=1,2,\cdots, n. $$

We defined its augmented Lagrangian multipliers as

$$ L_{\beta}^{n}(x_1,x_2,\cdots, x_n\mid \lambda)=\sum_{i=1}^{n} f_i(x_i) -\lambda^T (\sum_{i=1}^{n} A_i x_i - b) + \frac{\beta}{2} {({|\sum_{i=1}^{n} A_i x_i - b|})}_{2}^{2}. $$

Particularly, we firstly consider the case when $n=3$:

$$ L_{\beta}^{3}(x, y, z\mid \lambda)=f_1(x) + f_2(y) + f_3(z)-\lambda^T (A_1 x + A_2 y + A_3 z - b) \+\frac{\beta}{2}{|A_1 x + A_2 y + A_3 z - b|}_2^2. $$

It is natural and computationally beneficial to extend the original ADMM directly to solve the general n-block problem. A counter-example shows that this method diverges.


And Professor Bingsheng He, who taught me this section in his class, and his coauthors proposed some schemes for this problem based on his unified frame work for convex optimization and monotonic variational inequality.

Parallel splitting augmented Lagrangian method (abbreviated to PSALM) is described as follows:

  1. $x^{k+1}=\arg\min_{x}{L_{\beta}^3(x,y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y,z^k,\lambda^k)\mid y\in\mathbb{Y}}$;
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}},y^{\color{yellow}{k}},z,\lambda^k)\mid z\in\mathbb{Z}}$;
  4. $\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

We can add one more correction step $$ v^{k+1} := v^{k}-\alpha(v^k - v^{k+1}),\alpha\in (0, 2 -\sqrt{2}) $$

where $v^{k+1}=(y^{k+1},z^{k+1},\lambda^{k+1})$.

Another approach is to add an regularized terms:

  1. $x^{k+1}=\arg\min_{x}{L_{\beta}^3(x, y^k, z^k, \lambda^k)\mid x\in\mathbb{X}}$,
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y, z^k,\lambda^k)+\color{red}{\frac{\tau}{2}\beta{|A_2(y-y^k)|}^{2}}\mid y\in\mathbb{Y}}$,
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{\color{red}{k+1}}, y^{\color{yellow}{k}}, z, \lambda^k)+\color{blue}{\frac{\tau}{2}\beta{|A_3(z - z^k)|}^{2}}\mid z\in\mathbb{Z}}$,
  4. $\lambda^{k+1} = {\lambda}^{k} - \beta(A_1 x^{k+1} + A_2 y^{k+1} + A_3 z^{k+1}-b)$,

where $\tau&gt;1$.


Davis-Yin three operator splitting

If $f_1$ is strongly convex, then apply Davis-Yin (to dual problem) gives:

  1. $x^{k+1}=\arg\min_{x}{L^3_{\beta}(\color{green}{x},y^k,z^k,\lambda^k)\mid x\in\mathbb{X}}$;
  2. $y^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},\color{green}{y},z^k,\lambda^k)\mid y\in\mathbb{Y}}$;
  3. $z^{k+1}=\arg\min_{x}{L_{\beta}^3(x^{k+1},y^{k+1},\color{green}{z},\lambda^k)\mid z\in\mathbb{Z}}$;
  4. $\lambda^{k+1} = {\lambda}^{k}-\beta(A_1x^{k+1}+A_2y^{k+1}+A_3z^{k+1}-b)$.

where the notation $L^3(x, y, z, \lambda)$ is deefined by $$L^3(x, y, z, \lambda) = f_1(x) + f_2(y) + f_3(z)-{\lambda}^T(A_1 x + A_2 y + A_3 z - b)$$ is the Lagrangian rather than augmented Lagrangian.


Randomly Permuted ADMM given initial values at round $k$ is described as follows:

  1. Primal update
    • Pick a permutation $\sigma$ of ${1,.. ., n}$ uniformly at random;
    • For $i = 1,2,\cdots, n$, compute ${x}^{k+1}{\sigma(i)}$ by $$ x^{k+1}{\sigma(i)}=\arg\min_{x_{\sigma(i)}} L(x^{k+1}{\sigma(1)},\cdots, x^{k+1}{\sigma(i-1)}, x_{\sigma(i)}, x^{k+1}_{\sigma(i+1)},\cdots\mid \lambda^{k}). $$
  2. Dual update. Update the dual variable by $${\lambda}^{k+1}={\lambda}^{k}-\mu(\sum_{i=1}^{n}A_i x_i -b).$$

dlADMM

$$ \begin{array}{l}{\mathrm{ Problem 1. }} \ {\min {W{l}, b_{l}, z_{l}, a_{l}} R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right)} \ {\text { s.t. } z_{l}=W_{l} a_{l-1}+b_{l}(l=1, \cdots, L), a_{l}=f_{l}\left(z_{l}\right)(l=1, \cdots, L-1)}\end{array} $$ where $a_0\in\mathbb R^d$ is the input of the deep neural network where $n_0$ is the number of feature dimensions, and $y$ is a predefined label vector. $R\left(z_{L} ; y\right)$ is a risk function for the L-th layer, which is convex, continuous and proper, and $\Omega_{l}\left(W_{l}\right)$ is a regularization term for the l-th layer, which is also convex, continuous, and proper.

Rather than solving Problem 1 directly, we can relax Problem 1 by adding an $\ell_2$ penalty to address Problem 2 as follows:

$$ \begin{array}{l}{\mathrm{ Problem 2 }} \ {\begin{aligned} \min {W{l}, b_{l}, z_{l}, a_{l}} F(\boldsymbol{W}, \boldsymbol{b}, z, \boldsymbol{a})=& R\left(z_{L} ; y\right)+\sum_{l=1}^{L} \Omega_{l}\left(W_{l}\right) \ +\underbrace{(v / 2) \sum_{l=1}^{L-1}\left(\left|z_{l}-W_{l} a_{l-1}-b_{l}\right|{2}^{2}+\left|a{l}-f_{l}\left(z_{l}\right)\right|{2}^{2}\right)}{\text{$\ell_2$ penalty} } \ \text { s.t. } z_{L}=W_{L} a_{L-1}+b_{L} \end{aligned}}\end{array} $$

where $\mathbf{W}=\left{W_{l}\right}{l=1}^{L}, \mathbf{b}=\left{b{l}\right}{l=1}^{L}, \mathbf{z}=\left{z{l}\right}{l=1}^{L}, \mathbf{a}=\left{a{l}\right}_{l=1}^{L-1}$ and $\nu &gt;0$ is a tuning parameter.

Compared with Problem 1, Problem 2 has only a linear constraint $z_L = W_La_L−1 + b_L$ and hence is easier to solve


Monotone Operator Splitting Methods for Optimization

Monotone operator splitting methods, which originated in the late 1970’s in the context of partial differential equations, have started to be highly effective for modeling and solving a wide range of data analysis and processing problems, in particular high-dimensional statistical data analysis.

Operator splitting is to decompose one complicated operator(procedure) into some simple operators (procedures). For example, ADMM splits the maxmin operator of the augmented Lagrangian into 3 opertors: $$ \arg\min_{x,y}\max_{\lambda} L_{\beta}(x,y\mid \lambda) $$ to $$ \arg\min_{x}L_{\beta}(x,y\mid \lambda) \circ \ ,\arg\min_{y}L_{\beta}(x,y\mid \lambda) \circ \arg\max_{\lambda} L_{\beta}(x,y,\mid \lambda). $$

They are really some block relaxation techniques.

PDFP and PDHG

Primal-dual Fixed Point Algorithm

We demonstrate how different algorithms can be obtained by splitting the problems in different ways through the classic example of sparsity regularized least square model with constraint. In particular, for a class of linearly constrained problems, which are of great interest in the context of multi-block ADMM, can be also solved by PDFP with a guarantee of convergence. Finally, some experiments are provided to illustrate the performance of several schemes derived by the PDFP algorithm.

Primary Dual Hybrid Gradient

The Primal-Dual Hybrid Gradient (PDHG) method, also known as the Chambolle-Pock method, is a powerful splitting method that can solve a wide range of constrained and non-differentiable optimization problems. Unlike the popular ADMM method, the PDHG approach usually does not require expensive minimization sub-steps.

PDHG solves general saddle-point problems of the form $$\min_{x\in X}\max_{y\in Y} f(x)+\left&lt;Ax, y\right&gt;-g(y)$$ where $f$ and $g$ are convex functions, and $A$ is a linear operator.


  • $\hat x^{k+1}=x^k -\tau_k A^Ty^k$;
  • $x^{k+1}=\arg\min_{x\in X}f(x)+\frac{1}{2\tau_k}{|x-\hat x^{k+1}|}^2$;
  • $\tilde x^{k+1}=x^{k+1}+(x^{k+1}-x^{k})$;
  • $\hat y^{k+1}=y^k +\sigma_k A\tilde x^{k+1}$;
  • $y^{k+1}=\arg\min_{y\in Y} g(x)+\frac{1}{2\sigma_k}{|y-\hat y^{k+1}|}^2$.

Linear Programming

A linear program (LP) is an optimization problem in which the objective function is linear in the unknowns and the constraints consist of linear equalities and linear inequalities.

The standard problem is $$\min_{x}\mathbf{c^Tx} \ s.t.\quad \mathbf{Ax=b, x\geq 0}.$$

Here $\mathbf x$ is an n-dimensional column vector, $\mathbf{c^T}$ is an n-dimensional row vector, $\mathbf{A}$ is an $m \times n$ matrix, and $\mathbf b$ is an m-dimensional column vector. The vector inequality $\mathbf{x\geq 0}$ means that each component of $\mathbf x$ is nonnegative.

Fundamental Theorem of Linear Programming. Given a linear program in standard form where $\mathbf A$ is an $m × n$ matrix of rank $m$,

  1. if there is a feasible solution, there is a basic feasible solution;
  2. if there is an optimal feasible solution, there is an optimal basic feasible solution.

Linear programming is constrainted convex optimization problem. It is the simplest case of constrainted optimization problem in theory. However, it is useful in many cases.

If there is no constraints, the linear objectve function is unbounded.

Fixed Point Iteration Methods

The fixed point algorithm is initially to find approximate solutions of the equation

$$f(x)=0\tag 1$$ where $f:\mathbb{R}\to\mathbb{R}$.

In this method, we first rewrite the question(1) in the following form $$x=g(x)\tag 2$$

in such a way that any solution of the equation (2), which is a fixed point of the function ${g}$, is a solution of equation (1). For example, we can set $g(x)=f(x)+x, g(x)=x-f(x)$. Then consider the following algorithm.

  1. Give the initial point $x^{0}$;
  2. Compute the recursive procedure $x^{n+1}=g(x^n), n=1,2,\ldots$

So that finally we obtain an sequence ${x^0, x^1, \cdots, x^{n},\cdots}$. There are many methods to test whether this sequence is convergent or not as learnt in calculus.

To introduce the acceleration schemes, we define some order of convergence.

The order of convergence is defined as the constant $p$ such that $\lim_{n\to \infty}\frac{| x^{n+1}-x^{\ast}|}{| x^{n}-x^{\ast}|^p}=C$ if $\lim_{n\to\infty}x^{n}=x^{\ast}$, denoted as $O(\frac{1}{n^p})$.

name the order of convergence
sublinear $p&lt;1$
linear $p = 1$ and $C &lt; 1,$
superlinear $p &gt;1$
quadratic $p = 2$

Solving nonlinear equations

Solving nonlinear equation $$f(x)=0,$$ means to find such points $x^{\ast}$ such that $f(x^{\ast})=0$, where $f:\mathbb R\mapsto \mathbb R$ is nonlinear.

There are many methods for determining the value of $x^{\ast}$ by successive approximation. With any such method we begin by choosing one or more values $x_0, x_1, \cdots, x_r$, more or less arbitrarily, and then successively obtain new values $x_n$, as certain functions of the previously obtained $x_0, x_1, x_2,\cdots, x_{n-1}$ and possibly those of the derivatives $f^{\prime}(x_0), \cdots, f^{\prime}(x_{n-1})$ or higher derivative information.

Bisection method

Find a midpoint of interval $(a^k, b^k )$ and designate it $x^{k+1}=\frac{a^k+b^k}{2}$.

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \\ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Regula falsi (false position) method

$$f(x^{k+1})=f(x^k)-\alpha \frac{b^k -a^{k}}{f(b^k) - f(a^{k})}f(x^k)$$

and

$$ \left(a_{k+1}, b_{k+1}\right) =\left{\begin{array}{ll}{\left(a_{k}, x_{k+1}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)<0} \ {\left(x_{k+1}, b_{k}\right),} & {\text { if } \quad f\left(a_{k}\right) f\left(x_{k+1}\right)>0}\end{array}\right. $$

Secant method

The $k$-th approximation of root is obtained by $$f(x^{k+1})=f(x^k)-\alpha \frac{x^k -x^{k-1}}{f(x^k) - f(x^{k-1})}f(x^k)$$

Newton’s method

$$f(x^{k+1})=f(x^k)-\alpha {(f^{\prime})}^{-1}f(x^k)$$

Steffensen’s Method

Steffensen’s method is modified Newton’s method

$$f(x^{k+1})=f(x^k)-\alpha {(\frac{f(x^k+f(x^k))-f(x^k)}{f(x^k)})}^{-1}f(x^k)$$

Muller's Method

Muller’s method is a generalization of the secant method, in the sense that it does not require the derivative of the function.

It is an iterative method that requires three starting points $(p_0, f (p_0)), (p_1, f (p_1)),$ and $(p_2, f (p_2))$. A parabola is constructed that passes through the three points; then the quadratic formula is used to find a root of the quadratic for the next approximation.

Chebyshev Method

Halley's Methods

Cauchy's Methods

Aitken’s $\Delta^2$ method

  • Let $\left{p_{n}\right}$ be generated by a method which has a linear convergence,
  • Having $p_{0}, p_{1}$ and $p_{2}$ compute $\hat{p}{0}=p{0}-\frac{\left(p_{1}-p_{0}\right)^{2}}{p_{2}-2 p_{1}+p_{0}}, n=1,2, \ldots$
    1. compute $p_{n+2}$
    2. compute $\hat{p}{n}=p{n}-\frac{\left(p_{n+1}-p_{n}\right)^{2}}{p_{n+2}-2 p_{n+1}+p_{n}};$ and
    3. the algorithm terminates $p \approx \hat{p}{n}$ if $\left|\hat{p}{n}-\hat{p}_{n-1}\right|<\epsilon$.

Homotopy Continuation Methods

Homotopy Methods transform a hard problem into a simpler one whit easily calculated zeros and then gradually deform this simpler problem into the original one computing the zeros of the intervening problems and eventually ending with a zero of the original problem.

The homotopy method (continuation method, successive loading method) can be used to generate a good starting value.

Suppose one wishes to obtain a solution to a system of $N$ nonlinear equations in $N$ variables, say $$F(x)=0$$ where $F : \mathbb R^N \to \mathbb R^N$ is a mapping which, for purposes of beginning our discussion we will assume is smooth.

Since we assume that such a priori knowledge is not available, the iteration will often fail, because poor starting values are likely to be chosen.

We construct a parameter depending function $$H(x, s) = sF(x) + (1 − s)F_0(x), s\in [0,1]$$ and note, that $H(x, 0) = 0$ is the problem with known solution and $H(x, 1) = 0$ is the original problem $F(x) = 0$. As the solution of $H(x, s) = 0$ depends on s we denote it by $x^{\ast}(s)$. We discretize now the intervall into $0 = s_0 &lt; s_1 &lt; \cdots &lt; s_n = 1$ and solve a sequence of nonlinear systems with Newton’s method $$H(x, s_i) = 0$$


In high dimensional space, it is a little different. Fixed point iteration as well as the fixed point itself arises in many cases such as [https://arxiv.org/pdf/1511.06393.pdf].

The contracting mapping ${F}:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is defined as $$|F(x)-F(y)|\leq \alpha|x-y|, \forall x,y \in\mathbb{R},\alpha\in[0,1).$$ Thus $\lim_{|x-y|\to 0}\frac{|F(x)-F(y)|}{|x-y|}\leq \alpha\in[0,1)$.

Now we rewrite the necessary condition of unconstrainted optimization problems $\nabla f(x) = 0$ to the fixed point equation:

$$ \begin{align} \nabla f(x) = 0 \Rightarrow & x - \alpha\nabla f(x) = x \\ \nabla f(x) = 0 \Rightarrow g(x) - \alpha\nabla f(x) = g(x) \Rightarrow & x - g^{-1}(\alpha\nabla f(x)) = x\\ \nabla f(x) = 0\Rightarrow H(x)x- \alpha\nabla f(x) = H(x)x \Rightarrow & x - \alpha H(x)^{-1} \nabla f(x) = x \\ \nabla f(x) = 0 \Rightarrow M(x)\nabla f(x) = 0 \Rightarrow & x -\alpha M(x)\nabla f(x) = x \end{align} $$

where $H(x)$ is the lambda-matrix.

These correspond to gradient descent, mirror gradient methods, Newton's methods and quasi-Newton's methods, respectively.

And the projected (sub)gradient methods are in the fixed-point iterative form: $$ x = Proj_{\mathbb{S}}(x-\alpha \nabla f(x)) $$

as well as the mirror gradient and proximal gradient methods different from the projection operator.

Expectation maximization is also an accelerated fixed point iteration as well as Markov chain.

http://www.drkhamsi.com/fpt/books.html

The following figures in the table is form Formulations to overcome the divergence of iterative method of fixed-point in nonlinear equations solution

Fixed Point Iterations

Gaussian-Seidel Method and Component Solution Methods

The mapping $\hat T_i :X\mapsto X$, corresponding to an update of the ith block-component only, is given by $$\hat T_i(x)=\hat T_i(x_1, \dots, x_m)=(x_1,\dots, x_{i-1}, T_i(x), x_{i+1}, \dots, x_m).$$

Updating all block-components of $x$, one at a time in increasing order, is equivalent to applying the mapping $S: X\mapsto X$, defining by $$S=\hat{T}m\circ \hat{T}{m-1}\circ\cdots \circ \hat{T}1,$$ where $\circ$ denotes composition.
An equivalent definition of $S$ is given by the equation $$S_i=T
{i}(S_1(x), \dots, S_{i-1}(x), x_i, \dots, x_m)$$ where $S_i:X\mapsto X_i$ is the $i$th block-component of $S$. The mapping $S$ will be called the Gaussian-Seidel mapping based on the mapping $T$ and the iteration $x(t+1)=S(x(t))$ will be called Gaussian-Seidel algorithm based on the mapping $T$.

The system $x=T(x)$ can be decomposed into m smaller system of equations of the form $$x_i=T_i(x_1,\dots, x_m), \quad i=1,\dots,m,$$ which have to be solved simultaneously. We will consider an algorithm that solves at iteration the $i$th equation in the system for $x_i$, while keeping the other component fixed.

Given a vector $x(t)\in X$, the $i$th block-component $x_i(t+1)$ of the next vector is chosen to be a solution of the $i$th equation in the system, that is, $$x_i(t+1)\in{y_i\in X_i\mid y_i=T_i(x_1,\dots,x_{i-1}, y_i, x_{i+1},\dots, x_m)}.$$

ISTA and FASTA

The $\ell_1$ regularization is to solve the ill-conditioned equations such as $$\min_{x}{|Ax-b|}_2^2+\lambda{|x|}_1.$$

It is less sensitive to outliers and obtain much more sparse solutions (as opposed to $\ell_2$ regularization). Its application includes and is not restricted in LASSO, compressed sensing and sparse approximation of signals.

It is clear that the absolute value function is not smooth or differentiable everywhere even the objective function ${|Ax-b|}_2^2+\lambda{|x|}_1$ is convex. It is not best to solve this problem by gradient-based methods.

Iterative Shrinkage-Threshold Algorithms(ISTA) for $\ell_1$ regularization is $$x^{k+1}=\mathbf{T}{\lambda t}(x^{k}-tA ^{T}(Ax-b))$$ where $t> 0$ is a step size and $\mathbf{T}{\alpha}$ is the shrinkage operator defined by $${\mathbf{T}{\alpha}(x)}{i}={(x_i-\alpha)}{+}sgn(x{i})$$ where $x_i$ is the $i$ th component of $x\in\mathbb{R}^{n}$.

FISTA with constant stepsize

  • $x^{k}= p_{L}(y^k)$ computed as ISTA;
  • $t_{k+1}=\frac{1+\sqrt{1+4t_k^2}}{2}$;
  • $y^{k+1}=x^k+\frac{t_k -1}{t_{k+1}}(x^k-x^{k-1})$.

This will lead to the operator splitting methods analysesed by Wotao Yin and others.

Generic Acceleration Framework

  • Given
    • existing optimization procedure $M(f, x)$
    • previous iterates $x^{1},x^{2}, \cdots ,x^{k}$ and
    • new proposed guess $x^P = M(f, x^{k})$.
  • Find step $x^{k+1}$ using information at $x^{1},x^{2}, \cdots ,x^{k}, x^P$.

Relaxation and inertia

We will focus here on

  • unit memory,
  • accelerations using past operation outputs OR iterates.

Given an fixed point iteration $x^{k+1}={T}(x^k)$, there are two simple acceleration schemes.

$$\begin{aligned} x^{k+1}&=T_{1}\left(x^{k}+\nu^{k}\left(x^{k}-x^{k-1}\right)\right) \\ x^{k+2}&=T_{2}\left(x^{k+1}+\nu^{k+1}\left(x^{k+1}-x^{k}\right)\right) \end{aligned}$$

Relaxation Inertia Alternated Inertia
$T_1=T, T_2=I$ $T_1=T, T_2=T$ $T_1=T, T_2=I$
$\nu^{k}=0, \nu^{k+1}=\eta^{k / 2}-1$ $\nu^{k}=\gamma^{k}, \nu^{k+1}=\gamma^{k+1}$ $\nu^{k}=0, \nu^{k+1}=\gamma^{k+1}$
${x^{k+1}=T\left(x^{k}\right)}$ $x^{k+1}=T(x^{k}+\gamma^{k}\left(x^{k}-x^{k-1}\right))$ ${x^{k+1}=T\left(x^{k}\right)}$
$x^{k+2}=x^{k+1}+\nu^{k+1}\left(x^{k+1}-x^{k}\right)$ $x^{k+2}=T(x^{k+1}+\gamma^{k+1}\left(x^{k+1}-x^{k-1}\right))$ $x^{k+2}=T(x^{k+1}+\gamma^{k+1}\left(x^{k+1}-x^{k-1}\right))$

Anderson Acceleration

Let $H$ be a Hilbert space equipped with a symmetric inner product $\left&lt;\cdot, \cdot\right&gt;: H \times H \to R$. Let $T : H \to H$ be a nonexpansive mapping and consider for fixed $x_0 \in H$ the Halpern-Iteration (named after Benjamin Halpern, who introduced it): $$x^{k+1}= (1-{\alpha}_k)x^0+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1)$$

with ${\alpha}_k = \frac{k+1}{k+2}$ for approximating a fixed point of $T$.

It is proved that $\frac{1}{2}{| x^k -T(x^k)|}\leq\frac{|x^0-x^{\ast} |}{k+1}$.

Krasnosel'skii-Mann(KM, or averaged) iterations update $x^k$ in iteration ${k}$ to

$$ x^{k+1}=(1-\alpha_k)x^k+ {\alpha}_k T(x^k), {\alpha}_k \in (0,1) $$

for the fixed point problem (2). Specially, $T(x^k)= x^k-\alpha_k \nabla f(x^k)$ for the convex optimization problems. If the operator $T$ is non-expensive, then the sequence ${x^{k}\mid k=0,1,2,\dots}$ is convergent by Krasnosel'skii-Mann Theorem.

In 1953, Mann defined an iterative method: $$ x^{k+1}=M(x^k, \alpha_k , T)=(1-\alpha_k)x^k+ {\alpha}k T(x^k),\ \alpha_k \in [0,1),,\text{satisfying}, \sum{k=1}^{\infty}{\alpha}_k = \infty. $$

The sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(x^k), 0 < a\leq {\alpha}k \leq b < 1 $$ (additionally $\sum{k=1}^{\infty}{\alpha}k = \infty$ )is called a modied Mann iteration. Take ${\alpha_k, \beta_k}$ two sequences in $[0, 1]$ satisfying $$ \sum{k=1}^{\infty}{\alpha}_k {\beta}k = \infty, \lim{k\infty}\beta_k =0, 0 \leq {\alpha}_k \leq \beta_k \leq 1. $$

Then the sequence ${x^{k}}$ defined by

$$ x^{k+1}= (1-\alpha_k)x^k+ {\alpha}_k T(y^k), \\ y^{k} = (1-\beta_k)x^k + {\beta}_k T(x^k), $$

is called the Ishikawa iteration.

Hojjat Afshari and Hassen Aydi proposes another Mann type iteration: $$ x^{k+1} = {\alpha}_k x^k + {\beta}_k T(x^k) + {\gamma}_k T^{2}(x^k) $$

where ${\alpha}_k + {\beta}_k + {\gamma}_k = 1, {\alpha}_k, {\beta}k, {\gamma}k \in [0,1)$ for all $k\geq 1$, $\sum{k}^{\infty}(1-\alpha_k)=\infty$, $\sum{k=1}^{\infty}\gamma_k < \infty$.

If the Mann type iteration ${x^k}$ converges strongly to a point $p$, then $p$ is a fixed point of $T$.

There is an acceleration framework of fixed point iterations for the problem (2) called Anderson Acceleration or regularized nonlinear acceleration

  1. $F_k = (h_{k-m_k}, \dots, h_k)$ where $h_i=g(x_i)-x_i$;
  2. Determine $\alpha^{k}=(\alpha_0^{k},\dots, \alpha_{m_k}^{k})^{T}$ that solves $\min_{\alpha^{k}}{|F_k\alpha^k|}2^2$ s.t. $\sum{i=0}^{m_k}\alpha_i^{k}=1$;
  3. Set $x_{k+1}=\sum_{i=0}^{m_k} \alpha_i^{k} g(x_{k - m_k + i})$.

It is maybe interesting to introduce some Bregman divergence $D_{f}(\cdot, \cdot)$ instead of the squared $\ell_2$ norm when choosing $\alpha^{k}$ so that $$\alpha^{k}=\arg\min_{\alpha^k}{D_{f}(F_k \alpha^k)\mid \sum_{i=0}^{m_k}\alpha_i^{k}=1}.$$ Thus we would use mirror gradient methods to solve this problem.

It is proved that the Anderson acceleration converges if the fixed point mapping is cotractive.

The Baillon-Haddad Theorem provides an important link between convex optimization and fixed-point iteration, which proves that if the gradient of a convex and continuously differentiable function is non-expansive, then it is actually firmly non-expansive.

Anderson Acceleration of the Alternating Projections Method for Computing the Nearest Correlation Matrix

A correlation matrix is symmetric, has unit diagonal, and is positive semidefinite. Frequently, asynchronous or missing observations lead to the obtained matrix being indefinite.

A standard way to correct an invalid correlation matrix, by which we mean a real, symmetric indefinite matrix with unit diagonal, is to replace it by the nearest correlation matrix in the Frobenius norm, that is, by the solution of the problem $$\min{ {|A − X|}F : \text{X is a correlation matrix} },$$ where ${|A|}F^2=\sum{ij}a{ij}^2$.

DAAREM

Anderson Accelerated Douglas-Rachford Splitting

If the gradient equals to 0s, i.e., $\nabla f(x)=0$, it is possible to find some saddle points. It means that the fixed point iteration matters. It seems that ADMM (or generally Langragian multiplier method) is not the fixed point iteration. The fixed point iteration is in the form $x^{k}=f(x^k)$ where $f$ is usually explicitly expressed.

However, this form is easy to generalize any mapping or operators such as in functional analysis. In this sense, ADMM is really fixed point iteration: $(x^{k+1}, y^{k+1}, \lambda^{k+1})=ADMM(x^k, y^k, \lambda^k)$.

Approximate Minimal Polynomial Extrapolation

From linear to nonlinear iterative methods is not always direct and correct.

The following optimziation problem $$\min f(x)=\frac{1}{2}{|Ax-b|}_2^2$$ is equals to solve the linear system $Ax=b$.

Given $A\in\mathbb{R}^{n\times n}$ such that 1 is not an eigenvalue of $A$ and $v\in\mathbb{R}^n$, the minimal polynomial of $A$ with respect to the vector $v$ is the lowest degree polynomial $p(x)$ such that $p(A)v = 0, p(1) = 1$.

As shown before, the acceleration schemes are based on the linear combination of last iterated sequence. The question is why it is linear combination? Why not other extrapolation of the last updated values?

Nemirovski’s Acceleration

Let $f$ be a 1-smooth function. Denote $x^{+} = x - \nabla f(x)$. The algorithm simply returns the optimal combination of the conjugate point and the gradient descent point, that is:

$$x_{t+1}=\arg\min_{x\in P_t} f(x)$$ where $P_t=\operatorname{span}{x_t^{+},\sum_{s=1}^t \lambda_s\nabla f(x^s)}$.

It seems like a trust region methods where one region is given.

Damped Inertial Gradient System

Damped Inertial Gradient System (DIGS)

Alternating Anderson-Richardson method

Regularized Nonlinear Acceleration

We describe a convergence acceleration technique for generic optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.

$$ \begin{array}{l}{\text { Input: Sequence }\left{x_{0}, x_{1}, \ldots, x_{k+1}\right}, \text { parameter } \lambda>0} \\ {\text { 1: Form } U=\left[x_{1}-x_{0}, \ldots, x_{k+1}-x_{k}\right]} \\ {\text { 2. Solve the linear system }\left(U^{T} U+\lambda I\right) z=1} \\ {\text { 3. Set } c=z /\left(z^{T} \mathbf{1}\right)} \\ {\text { Output: Return } \sum_{i=0}^{k} c_{i} x_{i}, \text { approximating the optimum } x^{\ast}} \end{array} $$

Objective Acceleration

O-ACCEL (objective acceleration), is novel in that it minimizes an approximation to the objective function on subspaces of $\mathbb{R}^n$. We prove that O-ACCEL reduces to the full orthogonalization method for linear systems when the objective is quadratic, which differentiates our proposed approach from existing acceleration methods. Comparisons with the limited-memory Broyden–Fletcher–Goldfarb–Shanno and nonlinear conjugate gradient methods indicate the competitiveness of O-ACCEL.

However, it is best to think from the necessary condition of optima in non-convex optimization in my opinion. Another question is to generalize the fixed point iteration to stochastic gradient methods.

Direct Nonlinear Acceleration

Optimization acceleration techniques such as momentum play a key role in state-of-the-art machine learning algorithms. Recently, generic vector sequence extrapolation techniques, such as regularized nonlinear acceleration (RNA) of Scieur et al. (Scieur et al., 2016), were proposed and shown to accelerate fixed point iterations. In contrast to RNA which computes extrapolation coefficients by (approximately) setting the gradient of the objective function to zero at the extrapolated point, we propose a more direct approach, which we call direct nonlinear acceleration (DNA). In DNA, we aim to minimize (an approximation of) the function value at the extrapolated point instead. We adopt a regularized approach with regularizers designed to prevent the model from entering a region in which the functional approximation is less precise. While the computational cost of DNA is comparable to that of RNA, our direct approach significantly outperforms RNA on both synthetic and real-world datasets. While the focus of this paper is on convex problems, we obtain very encouraging results in accelerating the training of neural networks.

Proportional–Integral–Derivative Optimizer

The principle of feedback is simple an input, $x^n$, is given, processed through some function, $f$, and then the output, $y^n$, becomes the next input, $x^{n+1}$, repeatedly. When allowing the ouput to equal the next input, an identity exists so that $x^{n+1}=y^n$. Cobweb diagrams exploit the relationship, map the iterations, and reveal the behaviors of fixed points.

A PID controller continuously calculates an error $e(t)$, which is the difference between the desired optimal output and a measured system output, and applies a correction $u(t)$ to the system based on the proportional $(P)$, integral $(I)$, and derivative $(D)$ terms of $e(t)$. Mathematically, there is: $$u(t)= K_p e(t) + K_i\int_{0}^{t}e(x)\mathrm d x + K_d\frac{\mathrm d}{\mathrm dt}e(t) $$

where $u$ is the control signal and $e$ is the control error. The control signal is thus a sum of three terms:

  1. the P-term (which is proportional to the error);
  2. the I-term (which is proportional to the integral of the error);
  3. and the D-term (which is proportional to the derivative of the error).

The controller can also be parameterized as $$u(t)= K_p {e(t) + \frac{1}{T_i}\int_{0}^{t}e(x)\mathrm d x + T_d\frac{\mathrm d}{\mathrm dt}e(t)},\tag{PID}$$

where $T_i$ is called integral time and $T_d$ derivative time.

The proportional part acts on the present value of the error, the integral represent and average of past errors and the derivative can be interpreted as a prediction of future errors based on linear extrapolation.

Methods Recursion Integration
Gradient Descent $x^{t+1} = x^t -\alpha_t g(x^t)$ ?
Momentum Methods $x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$ ?
Nesterov's Gradient Methods $x^{t+1} =y^t -\alpha_t g(y^t), y^t = x^t + \rho_t(x^t -x^{t -1})$ ?
Newton's Methods $x^{t+1} = x^t - \alpha_i H_t^{-1}g(x^t)$ ?
Mirror Gradient Methods $\nabla h(x^{t+1})-\nabla h(x^t) = x^t - \alpha_t \nabla f(x^t) , x\in \mathbb{S}$ ?

By viewing the gradient $g(x^t)$ as error $e(t)$, and comparing it to PID controller, one can see that gradient descent only uses the present gradient to update the weights.

We rewrite the fomula $x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t(x^t - x^{t-1})$ as $$x^{t+1} = x^t -\alpha_t g(x^t) + \rho_t\underbrace{(x^t - x^{t-1})}{-\alpha{t-1}g(x^{t-1})+\rho_{t-1}(x^{t-1}-x^{t-2})}\ = x^t -\alpha_t g(x^t) - \sum_{i=1}^{t-1}[\prod_{j=0}^{i-1}\rho_{t-j}]\alpha_{t-i}g(x^{t-i})+ \rho_1(x^1-x^0) .$$

One can see that the update of parameters relies on both the present gradient and the integral of past gradients. The only difference is that there is a decay $\prod_{j=0}^{i-1}\rho_{t-j}$ term in the I term.


PID optimizer

The proposed PID optimizer updates parameter $x$ at iteration $(t +1)$ by:

  • $V^{t+1}=\alpha V^t -r g(x^t)$
  • $D^{t+1}=\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))$
  • $x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$

In a compact form, it is defined as following $$x^{t+1}=x^t+\alpha V^t -r g(x^t)+K_d [\alpha D^t +(1-\alpha)(g(x^t)-g(x^{t-1}))]\ = \alpha (V^t + K_d D^t) +x^t -r g(x^t)+ K_d(1-\alpha)[g(x^t)-g(x^{t-1})] $$

which looks like an ensmeble of inertia and relaxation techniques. We, for the first time, connect classical control theory with deep network optimization, and improve up to 50% the efficiency over SGD-Momentum!)

Now it is still a emprical method without any convergence proof.

As Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent, it is supposed to be converegnt in convex cases with some tuned parameters.

And it is simple to generalize Newton's methods where the gradients are replaced by rescaled gradients. In another word, the Newton type PID optimizer updates parameter $x$ at iteration $(t +1)$ by:

  • $V^{t+1}=\alpha V^t -r H^{-1}(x^t)g(x^t)$
  • $D^{t+1}=\alpha D^t +(1-\alpha)(H^{-1}(x^t)g(x^t)-H^{-1}(x^{t-1})g(x^{t-1}))$
  • $x^{t+1}=x^t+V^{t+1}+K_d D^{t+1}$.

The problem is that we have no theoretical proof while it inspired us how to ensemlble different ptimization methods or scehemes to accelerate the convergence procedure.

To overcome the oscillation problem in the classical momentum-based optimizer, recent work associates it with the proportional-integral (PI) controller, and artificially adds D term producing a PID controller. It suppresses oscillation with the sacrifice of introducing extra hyper-parameter.

By further analyzing the underlying constrained optimization problem, we have found that the two camps of distributed optimization can actually be related through the framework of proportional-integral control. It turns out that consensus methods with constant step-sizes are akin to proportional control and dual-decomposition is akin to integral control. In much the same way that proportional and integral control can be combined to create a control method with damped response and zero steady state error, the two methods of distributed optimization can be combined to produce a damped response with zero error.

Sample Recurrence Relation Idea of Successive Approximations
$x^{k+1}=M(x^k)$ $x^k=\underbrace{M(M(\cdots M(x^0)))}_{\text{k times}}$

Ensemble Methods of Optimization Algorithms

In machine learning, boosting algorithms can boost the weak base learner to higher accuarcy learner in probability correct algorithm framework.

The linear coupling technique shows that it is possible to combine some slow optimization methods to a faster method. The interaction of optimization methods and boosting algorithms may bring benifits to both fields.

A unified framework is expected to ensemble different optimization methods to get a faster one. As shown in above section, there are many acceleration schemes to combine the points generated by the same algorithm. Ensemble Methods of Optimization Algorithms are supposed to combine the points generated by different optimziation methods. Another theoretical consideration is to find some fractional order optimization methods. There are fractional differential equations, fractional drivatives but no fractional order optimization methods. I mean that fractional order optimization methods are some special combination of integer order optimization methods not only using fractional derivatives.

Note that there is one step of Newton's coorection in Halley's method. It is the composite of optimization methods beyond our scope of combination of optimization methods. For example, can we combine Halley's method and Newton's method to obtain a faster optimizatio method like linear coupling? Is there non-linear coupling of different methods?

Before answering the above questions, it is best to clarify how the convergence speed and order of derivative information.

There is a strong connection between convergence speed and order of derivative information in Higher-Order Gradient Method.

Alternated Inertia ensembles two schemes in fixed point iteration: relaxation and interia.

An intuitive way is to combine different optimziation methods of different order via Anderson acceleration.

Dynamical Systems

We will focus on the optimization methods in the form of fixed point iteration and dynamical systems. It is to minimize the following function $$ f(x), x\in\mathbb{R}^p, \quad\nabla f(x) = g(x), \quad\nabla^2 f(x)= H(x). $$

Iteration ODE Name
$x^{k+1}=x^{k}-\alpha_k g(x^k)$ $\dot{x}(t)=- g(x(t))$ Gradient descent
$x^{k+1}=x^{k}-\alpha_kH_k^{-1} g(x^k)$ $\dot{x}(t) =- H^{-1}(x)g(x(t))$ Newton's method

Like Newton interpolation, more points can compute higher order derivatives. The dynamics of accelerated gradient methods are expected to correspond to higher order differential equations.

Asymptotic Vanishing Damping

Some acceleration methods are iterations of the corresponding algorithms of Asymptotic Vanishing Damping called by Hedy Attouch:

$$ \quad \quad \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla \Phi (x(t)) =0.\tag{AVD} $$

where $\Phi(x(t))$ is dependent on the objective function; $\alpha &gt;0$ is constant in $\mathbb{R}$.

The fast minimization properties of the trajectories of the second-order evolution equation is also studied by Hedy's group in 2016: $$ \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla^2 \Phi (x(t))\dot{x}(t) + \nabla \Phi (x(t)) =0\tag{HDD} $$

When it comes to numerical solution to differential equations, it is to find the solution of the equations $x(t)$ so that the equations hold; in optimization, the optima is our goal so that the focus is limit order $$\lim_{t\to t_0} x(t)=x^{\star}$$ if possible where $x^{\star}$ optimizes the cost/objective function $f(x)$ specially $t_0=\infty$.

Symplectic Optimization

Symplectic Optimizatio It is difficult to generalize these methods to stochastic cases.

There is a wonderful summary DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION given by Micheal I Jordan at ICM 2018.

Some new connections between dynamical systems and optimization is found.


Weijie J. Su (joint with Bin Shi, Simon Du, and Michael Jordan) introduced a set of high-resolution differential equations to model, analyze, interpret, and design accelerated optimization methods.

$$ \ddot{x}(t) + 2\sqrt{\mu}\dot{x}(t) + \sqrt{s}\nabla^2 f(x) \dot{x}(t) + (1+\sqrt{\mu s})\nabla f(x) = 0, \tag{Su} $$


ADMM and Dynamics


The last but not least important question is how to rewrite the fixed point iteration as the discrete form of some differential equation. What is more, it is the interaction and connnetion between numerical solution to differential equations and optimization methods in form of fixed ponit iteration that matters.


GRADIENTS AND FLOWS: CONTINUOUS OPTIMIZATION APPROACHES TO THE MAXIMUM FLOW PROBLEM

Optimization for differential equations

Stochastic Approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

Robbins–Monro algorithm introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value.

Assume that we have a function ${\textstyle M(\theta )}:\mathbb{R}\mapsto\mathbb{R}$, and a constant ${\textstyle \alpha \in\mathbb R}$, such that the equation ${\textstyle M(\theta )=\alpha }$ has a unique root at ${\textstyle \theta ^{\ast}}$.

It is assumed that while we cannot directly observe the function ${\textstyle M(\theta )}$, we can instead obtain measurements of the random variable ${\textstyle N(\theta )}$ where ${\textstyle \mathbb{E} [N(\theta )]=M(\theta )}$. The structure of the algorithm is to then generate iterates of the form: $${\displaystyle {\theta}{n+1}= {\theta}{n} - a_{n}(N({\theta}{n})-\alpha )}\tag{R-M}$$ where $a{1},a_{2},\dots$ is a sequence of positive step sizes.

This process can be considered as fixed point iteration with random noise. Different from the determinant methods, the sequences they generated are also on random.

Robbins and Monro proved , Theorem 2 that $\theta_n$ converges in $L^{2}$ (and hence also in probability) to $\theta$ , and Blum later proved the convergence is actually with probability one, provided that:

${\textstyle N(\theta )}$ is uniformly bounded, ${\textstyle M(\theta )}$ is nondecreasing, ${\textstyle M'(\theta ^{\ast})}$ exists and is positive, and The sequence ${\textstyle a_{n}}$ satisfies the following requirements: $$\sum_{n=0}^{\infty} a_{n}=\infty \quad \fbox{ and }\quad \sum_{n=0}^{\infty} a_{n}^{2} &lt; \infty \quad$$ A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: ${\textstyle a_{n}=a/n}$, for ${\textstyle a&gt;0}$. Other series are possible but in order to average out the noise in ${\textstyle N(\theta )}$, the above condition must be met.

The basic ideas of the Robbins–Monro scheme can be readily modified to provide successive approximations for the minimum (or maximum) of a unimodal regression function, as was shown by Kiefer and Wolfowitz (1952) who introduced a recursive scheme of the form

$$\theta_{n+1}= {\theta}{n} - a{n}\Delta(x_{n})\tag{K-W}$$

to find the minimum $\theta$ of $M$ (or, equivalently, the solution of $dM/dx = 0$).

During the nth stage of the Kiefer–Wolfowitz scheme, observations $y_n^{(1)}$ and $y_n^{(2)}$ are taken at the design levels $x_n^{(1)} = x_n + c_n$ and $x_n^{(2)} = x_n − c_n$, respectively. In (KW), $\Delta(x_n)=\frac{y_n^{(1)}-y_n^{(2)}}{2c_n}$ an and $c_n$ are positive constants such that $c_n \to 0$, $\sum (c_n/a_n)^2&lt;\infty$ and $\sum a_n=\infty$.

Stochastic Gradient Descent

Stochastic gradient descent is classified to stochastic optimization which is considered as the generalization of gradient descent.

Stochastic gradient descent takes advantages of stochastic or estimated gradient to replace the true gradient in gradient descent. It is stochastic gradient but may not be descent. The name stochastic gradient methods may be more appropriate to call the methods with stochastic gradient. It can date back to stochastic approximation in statistics.

It is aimed to solve the problem with finite sum optimization problem, i.e. $$ \arg\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}f(\theta|x_i) $$ where $n&lt;\infty$ and ${f(\theta|x_i)}{i=1}^{n}$ are in the same function family and ${x_i}{i=1}^{n}\subset \mathbb{R}^{d}$ are constants while $\theta\in\mathbb{R}^{p}$ is the variable vector.

The difficulty is $p$, that the dimension of $\theta$, is tremendous. In other words, the model is overparameterized. And the number $n$ is far larger than $p$ generally, i.e. $n \gg p\gg d$. What is worse, the functions ${f(\theta|x_i)}_{i=1}^{n}$ are not convex in most case.


The stochastic gradient method is defined as $$ \theta^{k+1}=\theta^{k}-\alpha_{k}\frac{1}{m}\sum_{j=1}^{m}\nabla f(\theta^{k}| x_{j}^{\prime}) $$ where $x_{j}^{\prime}$ is draw from ${x_i}_{i=1}^{n}$ and $m\ll n$ is on random.

It is the fact $m\ll n$ that makes it possible to compute the gradient of finite sum objective function and its side effect is that the objective function is not always descent. Thus it is also called as mini-batch gradient descent.

There is fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.


The fluctuations in the objective function as gradient steps with respect to mini-batches are taken

An heuristic proposal for avoiding the choice and for modifying the learning rate while the learning task runs is the bold driver (BD) method[^14]. The learning rate increases exponentially if successive steps reduce the objective function $f$, and decreases rapidly if an “accident” is encountered (if objective function $f$ increases), until a suitable value is found. After starting with a small learning rate, its modifications are described by the following equation: $$ \alpha_{k+1}= \begin{cases} \rho \alpha_{k}, & {f(\theta^{k+1})< f(\theta^{k})}; \ \eta^n \alpha_{k}, & {f(\theta^{k+1})> f(\theta^{k})} \text{using ${\alpha}_k$}, \end{cases} $$

where $\rho$ is close to 1 such as $\rho=1.1$ in order to avoid frequent “accidents” because the objective function computation is wasted in these cases, $\eta$ is chosen to provide a rapid reduction ($\eta = 0.5$), and $n$ is the minimum integer such that the reduced rate $\eta^n$ succeeds in diminishing the objective function.[^13]

Except the learning rate or step length, there is yet another hyperparameter - the batch size $m$ in each iteration.


The fact that the sample size is far larger than the dimension of parameter, $n\gg p$, that makes it expensive to compute total objective function $f(\theta)=\sum_{i=1}^{n}f(\theta|{x}i)$. Thus it is not clever to determine the learning rate $\alpha_k$ by line search. And most stochastic gradient methods are to find proper step length $\alpha{k}$ to make it converge at least in convex optimization. The variants of gradient descent such as momentum methods or mirror gradient methods have their stochastic counterparts.

  • It is simplest to set the step length a constant, such as ${\alpha}_k=3\times 10^{-3}, \forall k$.
  • There are decay schemes, i.e. the step length ${\alpha}_k$ diminishes such as ${\alpha}_k=\frac{\alpha}{k}$, where $\alpha$ is constant.
  • And another strategy is to tune the step length adaptively such as AdaGrad, ADAM.

$\color{lime}{PS}$: the step length $\alpha_k$ is called learning rate in machine learning. Additionally, stochastic gradient descent is also named as increment gradient methods in some case.

The Differences of Gradient Descent and Stochastic Gradient Descent

We can see some examples to see the advantages of incremental method such as the estimation of mean. Given $x_1, x_2, \dots, x_n$ the mean is estimated as $\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}$. If now we observed more data $y_1, y_2, \dots, y_m$ from the population, the mean could be estimated by $\frac{n\bar{x}}{m+n} + \frac{\sum_{j=1}^{m} y_j }{m+n} =\frac{n\bar{x}+\sum_{j=1}^{m} y_j}{m+n}$. It is not necessary to summarize ${x_1, \dots, x_n}$.

Another example, it is Newton interpolation formula in numerical analysis. The task is to fit the function via polynomials given some point in th function $(x_i , f(x_i)), i = 1,2, \dots, n$. The Newton form of the interpolating polynomial is given by

$$ P_n(x) = a_0 + a_1 (x-x_1) + a_2 (x-x_1)(x-x_2) + \cdots \\ + a_n(x-x_1)(x-x_2)(x-x_3)\cdots (x-x_n). $$

This form is incremental and if another points $(x_{n+1}, f(x_{n+1}))$ is observed we will fit the function $f$ more precisely just by adding another term $a_{n+1}(x-x_1)(x-x_2)\cdots (x-x_n)(x-x_{n+1})$ where the coefficients $a_0, a_1,\cdots, a_n$ are determined by $f(x_1), f(x_2), \cdots, f(x_n)$.

See the following links for more information on stochastic gradient descent.

Convergence Analysis of Stochastic Gradient Methods

Progress in machine learning (ML) is happening so rapidly, that it can sometimes feel like any idea or algorithm more than 2 years old is already outdated or superseded by something better. However, old ideas sometimes remain relevant even when a large fraction of the scientific community has turned away from them. This is often a question of context: an idea which may seem to be a dead end in a particular context may become wildly successful in a different one. In the specific case of deep learning (DL), the growth of both the availability of data and computing power renewed interest in the area and significantly influenced research directions.

Leon Bottou

The stochastic gradient methods generates random/stochastic sequences, which are so different from the classic methods. The convergence of stochastic methods will bring another problem whether the random sequence is convergent to a optimal point in some sense.

Let ${X_1,\cdots, X_n,\cdots}$ be a random sequence and $X$ be a random variable.

Convergence in probability/measure: $$\lim_{n\to\infty}P(|X_n-X|\leq \epsilon )=1\forall \epsilon&gt;0$$

Convergence in mean square: $$\lim_{n\to\infty}\mathbb{E}({|X-X_n|}_2^2)=0$$

Convergence with probability 1(almost surely): $$P(\lim_{n\to\infty}X_n=X)=1$$ Convergence in distribution: $$\lim_{n\to\infty}F_{X_n}(x)=F_X(x)$$ for every $x$ at which $F_X(x)$ is continuous.


ADAM and More

Adam-type algorithms

ADAM composes of adaptive step strategies and momentum methods in some sense. It is widely used in deep learning training.

Formally, an Adam-type algorithm is of the following form:

where $x$ is the optimization variable, $g_t$ is a stochastic gradient at step $t$, $\beta_1,t$ is a non-increasing sequence, and $h_t$ is an arbitrary function that outputs a vector having the same dimension as $x$. This class of algorithms is very general, and it includes most of the popular training algorithms such as SGD (Ghadimi, et al., 2013), AdaGrad (Duchi et al., 2011), AMSGrad (Reddi et al., 2018), and Adam (Kingma et al., 2014) as special cases.

The Differences of Stochastic Gradient Descent and its Variants

YellowFin

We revisit SGD with Polyak's momentum, study some of its robustness properties and extract the design principles for a tuner, YellowFin. YellowFin automatically tunes a single learning rate and momentum value for SGD.

Natasha, Katyusha and Beyond

$\color{green}{PS}$: Zeyuan Allen-Zhu and others published much work on acceleration of stochastic gradient descent.

Variance Reduction Stochastic Gradient Methods

For general convex optimization, stochastic gradient descent methods can obtain an $O(1/\sqrt{T})$ convergence rate in expectation.

A more general version of SGD is the following $$\omega^{(t)}=\omega^{(t−1)}- g_t(\omega(t−1), \epsilon_t)$$

where $\epsilon_t$ is a random variable that may depend on $\omega^{(t−1)}$. And it is usually assumed that $\mathbb E(g_t(\omega^{(t−1)}, \epsilon_t)\mid \omega^{(t−1)} = \nabla f(\omega^{(t−1)})$.

Randomness introduces large variance if $g_t(\omega^{(t−1)}, \epsilon_t)$ is very large, it will slow down the convergence.


Procedure SVRG

  • input: update frequency $m$ and learning rate $\eta$
  • initialization: $\tilde{\omega}_0$
  • for $s=1,2,\cdots$ do
    • $\tilde w=\tilde w_{s-1}$
    • $\tilde{ \mu}=\nabla f(\tilde w)=\frac{1}{n}\sum_{i=1}^{n}\nabla f_i(\tilde{w})$
    • $\omega_0=\tilde{\omega}$
    • Randomly pick $i_t \in {1, ..., n}$ and update weight, repeat $m$ times $\omega^{(t)}=\omega^{(t−1)}- \eta_t [g_t(\omega(t−1), \epsilon_t)-g_t(\tilde\omega, \epsilon_t) - \tilde{ \mu}]$
    • option I: set $\tilde{\omega}_s={\omega}_m$
    • option II: set $\tilde{\omega}_s={\omega}_t$ for randomly chosen $t \in {1, ..., n-1}
  • end for

Escape From Saddle Points

Saddle points is considered as the fundamnetal problem in high dimensional space when training a deep learning model.

Noisy gradient descent can find a local minimum of strict saddle functions in polynomial time.

It is the inherent nature of stochastic gardient methods to escape saddle points.

Stochastic Fixed Point Equations

The Feynman-​-Kac formula implies that every suitable classical solution of a semilinear Kolmogorov partial differential equation (PDE) is also a solution of a certain stochastic fixed point equation (SFPE). In this article we study such and related SFPEs. In particular, the main result of this work proves existence of unique solutions of certain SFPEs in a general setting. As an application of this main result we establish the existence of unique solutions of SFPEs associated with semilinear Kolmogorov PDEs with Lipschitz continuous nonlinearities even in the case where the associated semilinear Kolmogorov PDE does not possess a classical solution.

Stochastic Proximal Point Methods

Stochastic Coordinate Fixed-point Iteration

There are examples such as Forward-Backward, Douglas-Rachford,... for finding a zero of a sum of maximally monotone operators or for minimizing a sum of convex functions.

Random block-coordinate Krasnoselskiı–Mann iteration
  • for $n=0,\cdots$
    • for $i=1, \cdots, m$
      • $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_i(x_{1,n},\cdots, x_{m, n})+a_{i, n}-x_{i, n})$

where

  • $x_0, a_n\in \mathbb H$ and $\mathbb H$ is separable real Hilbert space,
  • $\epsilon_{n}$ is random variable in ${0,1}^m\setminus \mathbf{0}$,
  • $\lambda_n\in (0, 1)$ and $\liminf \lambda_n&gt;0$ and $\limsup \lambda_n&lt;1$,
  • the mapping $\mathrm T:\mathbb H\to \mathbb H$ i.e. $x\mapsto (T_1x, \cdots, T_i x, \cdots, T_m x)$ is nonexpansive operator.
Double-layer random block-coordinate algorithms
  • for $n=0, 1, \cdots$
    • $y_n =\mathrm R_n x_n + b_n$
    • for $i=1, \cdots, m$
      • $x_{i, n+1}=x_{i, n}+\epsilon_{i, n}\lambda_n(\mathrm T_{i, n}(y_n)+a_{i, n}-x_{i, n})$
Random block-coordinate Douglas-Rachford splitting
Random block-coordinate forward-backward splitting
Asynchronous coordinate fixed-point iteration

Stochastic Optimization

Stochastic optimization problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.

We have developed a unified framework that covers all of these books. We break all sequential decision problems into five components: state variables, decision variables, exogenous information variables, the transition functionand theobjective function`, where we search over policies (functions) for making decisions. We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two classes, creating four classes that cover all of the solution approaches.

Distributed Optimization Methods

Beginning around ten years ago, the single-threaded CPU performance stopped improving significantly, due to physical limitations; it is the numbers of cores in each machine that continue to arise. Today we can buy 8-core phones, 64-core workstations, and 2.5k-core GPUs at affordable prices. On the other hand, most of our algorithms are still single-threaded, and because so, their running time is about the same as it was to ten years ago and will stay so in the future, say, ten or twenty years. To develop faster algorithms, especially for those large-scale problems that arise in signal processing, medical imaging, and machine learning, it is inevitable to consider parallel computing. Machine learning especially deep learning requires more powerful distributed and decentralized optimization methods.

Large scale supervised machine learning methods, which are based on gradient to improve their performance, need online near-real-time feedback from massive users.

Parallelizing Stochastic Gradient Descent

Elastic Stochastic Gradient Descent

It is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master.

The loss function of Elastic-SGD $$x^{\ast}=\arg\min_{x, x^1, x^N}\frac{1}{N}\sum_{n=1}^{N} f(x^n)+\frac{1}{2\rho N}|x-x^n|^2.$$

Parle

Parle exploits the phenomenon of wide minima that has been shown to improve generalization performance of deep networks and trains multiple “replicas” of a network that are coupled to each other using attractive potentials. It requires infrequent communication with the parameter server and is well-suited to singlemachine-multi-GPU as well as distributed settings.

The method replace the loss function by a smoother loss called local entropy $$f_{\gamma}^{\beta}(x)=-\frac{1}{\beta}\log(G_{\gamma/\beta}\times \exp(-\beta f(x)))$$ where $G_{\gamma/\beta}$ is the Gaussian kernel with variance $\gamma/\beta$.

Parle solves for $$x^{\ast}=\arg\min_{x, x^1, x^N}\sum_{n=1}^{N} f_{\gamma}^{\beta}(x^n)+\frac{1}{2\rho N}| x - x^n|^2.$$

Asynchronous Stochastic Gradient Descent

Asynchronous Stochastic Gradient (shared memory):

  • Each thread repeatedly performs the following updates:
    • For $t = 1, 2, \cdots$
      • Randomly pick an index $i$
      • $x\leftarrow x - \nabla f_i(x)$.

Main trick: in shared memory systems, every threads can access the same parameter $x$.

Asynchronous Accelerated Stochastic Gradient Descent:

Hogwild

Hogwild allows processors access to shared memory with the possibility of overwriting each other's work.


Distributed Nesterov-like gradient methods

Parallel Coordinate Methods

Distributed Non-convex Optimization Problems

Stochastic ADMM

Linearly constrained stochastic convex optimization is given by $$ \min_{x,y}\mathbb{E}{\vartheta}[F(x,\vartheta)]+h(y),\ s.t. , Ax+By = b, x\in\mathbb{X}, y\in\mathbb{Y}. $$ where typically the expectation $\mathbb{E}{\vartheta}[F(x,\vartheta)]$ is some loss function and ${h}$ is regularizer to prevent from over-fitting.

The first problem is that the distribution of $\vartheta$ is unknown as well as the expectation $\mathbb{E}_{\vartheta}[F(x,\vartheta)]$ in the objective function.

Modified Augmented Lagrangian

Linearize $f(x)$ at $x_k$ and add a proximal term:

$$ L_{\beta}^{k}(x,y,\lambda):= f(x_k)+\left<x_k, g_k\right>+h(y)-\left< \lambda, Ax+By-b\right>+\frac{\beta}{2}{|Ax + By-b|}2^2 \+\frac{1}{2\eta_k}|x-x{k}|^2 $$

where $g_k$ is a stochastic (sub)gradient of ${f}$.

  1. $x^{k+1}=\arg\min_{x\in\mathbf{X}}L_{\beta}^{k}(x,y^{\color{aqua}{k}},\lambda^{\color{aqua}{k}});$
  2. $y^{k+1}=\arg\min_{y\in\mathbf{Y}} L_{\beta}^{k}(x^{\color{red}{k+1}}, y, \lambda^{\color{aqua}{k}});$
  3. $\lambda^{k+1} = \lambda^{k} - \beta (Ax^{\color{red}{k+1}} + By^{\color{red}{k+1}}-b).$

Distributed ADMM

Let us assume that the function $F$ has the following decomposition where each $x_i$ has its own dimension. $$ \min_{x} F(x){=\sum_{i=1}^{n}f_i(x_i)}, \ s.t. \quad x\in\mathrm{C} $$

We can reformulate the problem to get $$ \min_{x} F(x) + g(y), \ s.t.\quad x=y $$

where $g(z)$ is the indictor function of set $\mathrm{C}$.

We can dene an augmented Lagrangian $$ L_{\beta}(x, y, \lambda) = F(x)+g(y) - \lambda^{T}(x-y) + \frac{\beta}{2}{|x-y|}{2}^{2} \ = \sum{i=1}^{n} [f_i(x_i) - {\lambda}i(x_i -y_i)+\frac{\beta}{2}(x_i - y_i)^2] + g(y)\ = \sum{i=1}^{n}L_{(\beta,i)}(x_i, y, \lambda_i). $$

We can split the optimization over $x_i$:

  1. $x_i^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i, y^{k}, \lambda_i^{k})\quad i=1,2,\cdots, n;$
  2. $y^{k+1} =\arg\min_{x_i} L_{(\beta,i)}(x_i^{\color{green}{k+1}}, y, \lambda_i^k);$
  3. $\lambda^{k+1}=\lambda^k+\lambda (x^{\color{green}{k+1}} - y^{\color{green}{k+1}}).$

Then optimization over $x_i$ is done in parallel. Subsequently, the results are communicated back to a master node which performs the $y$ update (usually just a projection as in the example above) and returns back the result to other worker nodes. The update over $\lambda_i$ is again done independently.

Distributed Optimization: Analysis and Synthesis via Circuits

General form: $$\sum_{i=1}^{n}f_i(x_i),\ s.t. x_i= E_i y$$

Resource on Distributed Optimization Methods

Surrogate Optimization

It is a unified principle that we optimize an objective function via sequentially optimizing surrogate functions such as EM, ADMM. In another word, these methods do not dircetly optimze the objective function.

It is obvious that the choice of optimization methods relies on the objective function. Surrogate loss transform the original problem $\min_{x} f(x)$ into successive trackable subproblems: $$ x^{k+1}=\arg\min_{x} Q_{k}(x). $$

We will call $Q_k(x)$ surrogate function. Surrogate function is also known as merit function.

As a Fabian Pedregosa asserted, a good surrogate function should:

  • Approximate the objective function.
  • Easy to optimize.

This always leads to the convexification technique where the surrogate function is convex. Usually, $f(x^{k+1})\leq f(x^k)$ such as EM or MM. Generaly, it is required that $\lim_{k\to\infty}x^k\in\arg\min_{x}f(x)$.

For example, we can rewrite gradient descent in the following form $$ x^{k+1}=\arg\min_{x} {f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}{|x-x^k|}_2^2}. $$

In Newton’s method, we approximate the objective with a quadratic surrogate of the form: $$ Q_k(x) = f(x^k)+\left<\nabla f(x^k), x-x^k\right>+\frac{1}{2\alpha_k}(x-x^k)^{T}H_k(x-x^k). $$

Note that the Hessian matrix $H_k$ is supposed to be positive definite. The quasi-Newton methods will approximate the Hessian matrix with some inverse symmetric matrix. And they can rewrite in the principle of surrogate function, where the surrogate function is convex in the form of linear function + squared functions in some sense.

Note that momentum methods can be rewrite in the surrogate function form: $$ x^{k+1}=x^{k}-\alpha_{k}\nabla_{x}f(x^k)+\rho_{k}(x^k-x^{k-1}) \ = \arg\min_{x}{f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{2\alpha_k} {|x-x^k-\rho_k(x^k-x^{k-1})|}_2^2}. $$

$\color{aqua}{PS}$: How we can extend it to Nesterov gradient methods or stochastic gradient methods?

It is natural to replace the squared function with some non-negative function such as mirror gradient methods $$ x^{k+1} = \arg\min_{x} { f(x^k) + \left<\nabla f(x^k), x-x^k\right> + \frac{1}{\alpha_k} B(x,x^k)}. $$

More generally, auxiliary function $g_k(x)$ may replace the Bregman divergence $B(x, x^k)$, where the auxiliary functions $g_k(x)$ have the properties $g_k(x)\geq 0$ and $g_k(x^{k-1})=0$.

The parameters $\alpha_k$ is chosen so that the surrogate function is convex by observing that the condition number of Hessian matrix is determined by the factor $\frac{1}{\alpha_k}$. And gradient are used in these methods to construct the surrogate function. There are still two problems in unconstrained optimization:

  1. If the cost function is not smooth or differential such as absolute value function, the gradient is not available so that it is a problem to construct a convex surrogate function without gradient;
  2. In another hand, there is no unified principle to construct a convex surrogate function if we know more information of higher order derivatives.

Practically, we rarely meet pure black box models; rather, we know something about structure of underlying problems One possible strategy is:

  1. approximate nonsmooth objective function by a smooth function
  2. optimize smooth approximation instead (using, e.g., Nesterov’s accelerated method).

A convex function $f$ is called $(\alpha, \beta)$-smoothable if, for any $\mu &gt; 0, \exists$ convex function $f_{\mu}$ s.t.

  • $f_{\mu}(x) \leq f(x) \leq f_{\mu}(x) + \beta\mu, \forall x$
  • $f_{\mu}(x)$ is $\frac{\alpha}{\mu}$ smooth.

Moreau envelope (or Moreau-Yosida regularization) of a convex function $f$ with parameter $\mu &gt; 0$ is defined as $$ M_{\mu f}(x)=\inf_{z}{f(z) + \frac{1}{2\mu}{|z-x|}_2^2}. $$

Minimizing $f$ and $M_f$ are equivalent. $prox_{\mu f} (x)$ is unique point that achieves the infimum that defines $M_{\mu f}$ , i.e.

$$ M_{\mu f}(x)=f(prox_{\mu f} (x)) + \frac{1}{2}{|prox_{\mu f} (x)-x|}_2^2}. $$

Moreau envelope $M_f$ is continuously differentiable with gradients $$ \nabla M_{\mu f} = \frac{1}{\mu}(prox_{\mu f} (x)-x). $$

This means $$ prox_{\mu f} (x)=x-\mu \nabla M_{\mu f} $$

i.e., $prox_{\mu f} (x)$ is gradient step for minimizing $M_{\mu f}$.

Fenchel Conjugate of a function ${h}$ is the function $h^{\star}$ defined by

$$ h^{\star}(x)=\sup_{z} {\left<z, x\right>-h(z)}. $$


Charles Byrne gives a unified treatment of some iterative optimization algorithms such as auxiliary function methods.

Young Recent Now

He is a featured author of CRC press and professor in UML

Auxiliary-Function Methods minimize the function $$ G_k(x) = f(x) + g_k(x) $$ over $x\in \mathbb{S}$ for $k=1,2,\cdots$ if $f(x):\mathbb{S}\mapsto\mathbb{R}$ and $g_k(x)$ is auxiliary function. We do not linearize the cost function as the previous methods.

Auxiliary-Function Methods(AF methods) include

  • Barrier- and Penalty-Function Algorithms such as sequential unconstrained minimization (SUM) methods, interior-point methods exterior-point methods;
  • Proximal Minimization Algorithms such as majorization minimization.

And an AF method is said to be in the SUMMA class if the SUMMA Inequality holds:

$$ G_k(x) - G_k(x^k)\geq g_{k+1}(x), \forall x\in \mathbb{S}. $$

Proximal minimization algorithms using Bregman distances, called here PMAB, minimize the function

$$ G_k(x) = f(x) + B(x, x^{k-1}),\forall x\in \mathbb{S}. $$

The forward-backward splitting (FBS) methods is to minimize the function $$f_1(x) + f_2(x)$$ where both functions are convex and $f_2(x)$ is differentiable with its gradient L-Lipschitz continuous in the Euclidean norm. The iterative step of the FBS algorithm is $$ x^{k} = \operatorname{prox}_{\gamma f_1}(x^{k-1}-\gamma \nabla f_2(x^{k-1})). $$

It is equivalent to minimize $$ G_k(x) = f(x) + \frac{1}{2\gamma} {|x-x^{k-1}|}_2^2-B(x, x^{k-1}), $$

where $B(x,x^{k-1})=f_1(x)-f_1(x^{k-1})-\left&lt;\nabla f_1(x^{k-1}),x-x^{k-1}\right&gt;,, 0 &lt; \gamma\leq \frac{1}{L}$. Alternating Minimization (AM) can be regarded as coordinate optimization methods with 2 blocks:

  • $p^{n+1}=\arg\min_{p\in P} f(p,q^n)$,
  • $q^{n+1}=\arg\min_{p\in Q} f(p^{n+1},q)$,

where $f(p, q), p\in P, q\in Q$ is the objective function. It is proved that the sequence $f(p^n, q^n)$ converge tothe minimizer if the Five-Point Property hold: $$ f(p, q) + f(p, q^{n-1}) \geq f(p, q^{n}) + f(p^n, q^{n-1}). $$

For each p in the set P, define $q(p)$ in Q as a member of Q for which $f(p; q(p)) \leq f(p; q)$, for all $q \in P$. Let $\hat{f}(p) = f(p; q(p))$. At the nth step of AM we minimize $$G_n(p)=f(p; q(p))+[f(p; q^{n-1})-f(p; q(p))]$$ to get $p^n$, where $g_n(p)=f(p; q^{n-1})-f(p; q(p))$ is the auxiliary function. So we can write that $G_n(p)= f(p;q(p))+g_n(p)$.

See Iterative Convex Optimization Algorithms; Part Two: Without the Baillon–Haddad Theorem or Alternating Minimization, Proximal Minimization and Optimization Transfer Are Equivalent for more information on AF methods.

Expectation-Maximization can be classified to AF method.

  • $Q(\theta|\theta^{(t)})=\mathbb{E}(\ell(\theta|Y_{obs}, Z)|Y_{obs},\theta^{(t)})$;
  • $\theta^{(t+1)}=\arg\min_{\theta} Q(\theta|\theta^{(t)})$.

The $Q(\theta|\theta^{(t)})$ function is log-likelihood function of complete data $(Y_{os}, Z)$ given $(Y_{obs}, \theta^{(t)})$.


Optimization using surrogate models applied to Gaussian Process models (Kriging).

Relaxation and Convexification

Block Coordinate Descent

The methods such as ADMM, proximal gradient methods do not optimize the cost function directly. For example, we want to minimize the following cost function $$ f(x,y) = g(x) + h(y) $$

with or without constraints. Specially, if the cost function is additionally separable, $f(x) = f_1(x_1) + f_2(x_2) + \cdots + f_n(x_n)$, we would like to minimize the sub-function or component function $f_i(x_i), i=1,2,\cdots, n$ rather than the cost function itself

$$ \min_{x} \sum_{i} f_i(x_i) \leq \sum_{i}\min_{x_i}{f_i(x_i)}. $$

And ADMM or proximal gradient methods are to split the cost function to 2 blocks, of which one is differentiable and smooth while the other may not be differentiable. In another word, we can use them to solve some non-smooth optimization problem. However, what if there is no constraints application to the optimization problem?

Coordinate descent is aimed to minimize the following cost function $$ f(x) = g(x) +\sum_{i}^{n} h_i(x_i) $$

where $g(x)$ is convex, differentiable and each $h_i(x)$ is convex. We can use coordinate descent to find a minimizer: start with some initial guess $x^0$, and repeat for $k = 1, 2, 3, \dots$:


  1. $x_{1}^{k} \in\arg\min_{x_1}f(x_1, x_2^{k-1}, x_3^{k-1}, \dots, x_n^{k-1});$
  2. $x_{2}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2,x_3^{k-1},\dots, x_n^{k-1});$
  3. $x_{3}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3,\dots, x_n^{k-1});$
  4. $\vdots$
  5. $x_{n}^{k} \in\arg\min_{x_1}f(x_1^{\color{red}{k}}, x_2^{\color{red}{k}},x_3^{\color{red}{k}},\dots, x_n).$

It can extended to block coordinate descent(BCD) if the variables ${x_1, x_2, \dots, x_n}$ are separable in some blocks.


Nesterov’s Random Coordinate Descent Algorithms

Block Splitting Methods

In particular, if the problem data includes a large linear operator or matrix $A$, the method allows for handling each subblock of $A$ on a separate machine. The approach works as follows. First, we define a canonical problem form called graph form, in which we have two sets of variables $x$ and $y$ related by a linear operator $A$, such that the objective function is separable across these two sets of variables. Many problems are easily expressed in graph form, including cone programs and a wide variety of regularized loss minimization problems from statistics, like logistic regression, the support vector machine, and the lasso. Next, we describe graph projection splitting, a form of Douglas-Rachford splitting or the alternating direction method of multipliers, to solve graph form problems serially. Finally, we derive a distributed block splitting algorithm based on graph projection splitting. In a statistical or machine learning context, this allows for training models exactly with a huge number of both training examples and features, such that each processor handles only a subset of both. To the best of our knowledge, this is the only general purpose method with this property. We present several numerical experiments in both the serial and distributed settings.

Block Relaxation Methods

The methods discussed in the book Block Relaxation Methods in Statistics are special cases of what we shall call block-relaxation methods, although other names such as decomposition or nonlinear Gauss-Seidel or ping-pong or seesaw methods have also been used.

block relaxation methods

In a block relaxation method we minimize a real-valued function of several variables by partitioning the variables into blocks. We choose initial values for all blocks, and then minimize over one of the blocks, while keeping all other blocks fixed at their current values. We then replace the values of the active block by the minimizer, and proceed by choosing another block to become active. An iteration of the algorithm steps through all blocks in turn, each time keeping the non-active blocks fixed at current values, and each time replacing the active blocks by solving the minimization subproblems. If there are more than two blocks there are different ways to cycle through the blocks. If we use the same sequence of active blocks in each iteration then the block method is called cyclic.

In the special case in which blocks consist of only one coordinate we speak of the coordinate relaxation method or the coordinate descent (or CD) method. If we are maximizing then it is coordinate ascent (or CA). The cyclic versions are CCD and CCA.

Augmentation Methods

Augmentation and Decomposition Methods Note: augmentation duality. $$ h(y)=\min_{x\in\mathcal{X}} g(x,y) $$ then $$ \min_{x\in\mathcal{X}}f(x)=\min_{x\in\mathcal{X}}\min_{y\in\mathcal{Y}}g(x,y)=\min_{y\in\mathcal{Y}}\min_{x\in\mathcal{X}}g(x,y)=\min_{y\in\mathcal{Y}}h(y). $$

Alternating Conditional Expectations

The alternating descent conditional gradient method

A ubiquitous prior in modern statistical signal processing asserts that an observed signal is the noisy measurement of a few weighted sources. In other words, compared to the entire dictionary of possible sources, the set of sources actually present is sparse. In many cases of practical interest the sources are parameterized and the measurement of multiple weighted sources is linear in their individual measurements.

As a concrete example, consider the idealized task of identifying the aircraft that lead to an observed radar signal. The sources are the aircraft themselves, and each is parameterized by, perhaps, its position and velocity relative to the radar detectors. The sparse inverse problem is to recover the number of aircraft present, along with each of their parameters.

Convex Relaxations

Convex relaxations are one of the most powerful techniques for designing polynomial time approximation algorithms for NP-hard optimization problems such as Chromatic Number, MAX-CUT, Minimum Vertex Cover etc. Approximation algorithms for these problems are developed by formulating the problem at hand as an integer program.

Majorization Method

Suppose we can find a function $h$ such that $g(\alpha)\leq h(\alpha)$ for all $0&lt;\alpha &lt;\overline{\alpha}$ and such that $g(0)=h(0)$.

Now set $$\alpha^{(k)} :=\arg\min_{0≤\alpha≤\overline{\alpha}} h(\alpha).$$

Then the sandwich inequality says $$g{(\alpha^{(k)})}\leq h(\alpha^{(k)})\leq h(0)=g(0),$$

and thus $f(x^{(k)}−\alpha^{(k)}p(k)) \leq f(x^{(k)})$.

Quadratic Majorization

A quadratic $g$ majorizes $f$ at $y$ on $\mathbb{R}^n$ if $g(y)=f(y)$ and $g(x)\geq f(x)$ for all $x$. If we write it in the form $$g(x)=f(y)+(x-y)'b+\frac{1}{2} (x-y)'A(x-y)$$

Lagrangian Relaxation

Lagrangian relaxation (Fisher, 2004) is a technique that exploits the fact that many difficult MILP problems can be seen as relatively easy problems complicated by a set of side constraints. These constraints can be transferred to the objective function with associated parameters (Lagrangian multipliers), which impose a penalty on violations. The relaxed problem provides an upper bound (for a maximization problem) on the optimal value of the original problem. To obtain the tightest bound, a problem on the Lagrangian multipliers (Lagrangian dual) needs to be solved. This problem is typically solved with an iterative method called subgradient method, which uses subgradients of the objective function to update the Lagrangian multipliers at each iteration.

Deep Relaxation

In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks.


Convexification

nonconvex

In order for primal-dual methods to be applicable to a constrained minimization problem, it is necessary that restrictive convexity conditions are satisfied. A nonconvex problem can be convexified and transformed into one which can be solved with the aid of primal-dual methods.

We consider problems that are non-convex in the input norm, which is a semi-continuous variable that can be zero or lower- and upper-bounded. Using lossless convexification, the non-convex problem is relaxed to a convex problem whose optimal solution is proved to be optimal almost everywhere for the original problem. The relaxed problem can be solved using second-order cone programming, which is a subclass of convex optimization for which there exist numerically reliable solvers with convergence guarantees and polynomial time complexity. This is the first lossless convexification result for mixed-integer optimization problems. An example of spacecraft docking with a rotating space station corroborates the effectiveness of the approach and features a computation time almost three orders of magnitude shorter than a mixed-integer programming formulation.

Successive Convex Approximation

In this paper, we study an alternative inexact BCD approach which updates the variable blocks by successively minimizing a sequence of approximations of which are either locally tight upper bounds of f or strictly convex local approximations of f. We focus on characterizing the convergence properties for a fairly wide class of such methods, especially for the cases where the objective functions are either non-differentiable or nonconvex.

Sequential Quadratic Programming

Gradient Free Optimization Methods

As shown in Principle of Optimal Design, non-gradient methods are classified into 3 classes:

We organize the discussion of non-gradient methods in three parts, direct search methods, heuristic methods, and black-box methods. Direct search methods rely on ranking the objective function values rather than using the objective values themselves. Heuristic methods use some random process to generate new candidate solutions, and so the iteration paths and even the solutions obtained can change each time we perform a search. Black-box methods deal with problems that have no known model function structure that we can exploit. For example, functions generated by simulations have no discernible mathematical properties (like convexity), and so we refer to them as black-box functions. In this sense, all nongradient methods can be used for black-box problems. The two black-box methods described in this chapter were created to address design problems with expensive simulations, and so their main goal is to find an optimum quickly with few function evaluations.

Derivative-free optimization (DFO) algorithms differ in the way they use the sampled function values to determine the new iterate. One class of methods constructs a linear or quadratic model of the objective function and defines the next iterate by seeking to minimize this model inside a trust region. We pay particular attention to these model-based approaches because they are related to the unconstrained minimization methods described in earlier chapters. Other widely used DFO methods include the simplex-reflection method of Nelder and Mead, pattern-search methods, conjugate-direction methods, and simulated annealing.

Heuristic methods will be introduced in computational intelligence as well as Bayesian Optimization.

Let us start with the example and suppose that we want to $$ \arg\min_{x}f(x)=x^{2}+4\sin(2x).\tag{1} $$

The objective function $f(x)$, a non-convex function, has many local minimizer or extrema. The function (1) is upper bounded by $x^2+4$ and lower bounded by $x^2 - 4$.

pluskid

Another insightful example is to minimize the following cost function: $$ x^{2}+4\sin(2x) - 1001 \color{red}{\mathbb{I}{\sqrt{2}}}(x) \tag{2} $$ where the last part $\color{red}{\mathbb{I}{\sqrt{2}}}(x)$ is equal to 1 when $x=\sqrt{2}$ and 0 otherwise, a Dirac function. It is almost equal to the first function (1) except at the point $x=\sqrt{2}$. The minimal value of the above function is $2+4\sin(2\sqrt{2})-1001$ when $x=\sqrt{2}$. And these two functions are two different kinds of non-convex functions.

Optimization and Assumptions @ Freemind|Test functions for optimization

Zeroth-Order Oracle

Zeroth-Order optimization is increasingly embraced for solving big data and machine learning problems when explicit expressions of the gradients are difficult or infeasible to obtain. It achieves gradient-free optimization by approximating the full gradient via efficient gradient estimators.

Derivative-Free Optimization

The absence of derivatives, often combined with the presence of noise or lack of smoothness, is a major challenge for optimization. This book explains how sampling and model techniques are used in derivative-free methods and how these methods are designed to efficiently and rigorously solve optimization problems. Introduction to Derivative-Free Optimization is the first contemporary comprehensive treatment of optimization without derivatives. This book covers most of the relevant classes of algorithms from direct search to model-based approaches. It contains a comprehensive description of the sampling and modeling tools needed for derivative-free optimization; these tools allow the reader to better understand the convergent properties of the algorithms and identify their differences and similarities. Introduction to Derivative-Free Optimization also contains analysis of convergence for modified Nelder Mead and implicit-filtering methods, as well as for model-based methods such as wedge methods and methods based on minimum-norm Frobenius models.

DFO is a Fortran package for solving general nonlinear optimization problems that have the following characteristics:

  • they are relatively small scale (less than 100 variables),
  • their objective function is relatively expensive to compute and derivatives of such functions are not available and
  • cannot be estimated efficiently.

There also may be some noise in the function evaluation procedures. Such optimization problems arise ,for example, in engineering design, where the objective function evaluation is a simulation package treated as a black box.

Alongside derivative-based methods, which scale better to higher-dimensional problems, derivative-free methods play an essential role in the optimization of many practical engineering systems, especially those in which function evaluations are determined by statistical averaging, and those for which the function of interest is nonconvex in the adjustable parameters. We are in the process of developing an entire new family of surrogate-based derivative-free optimization schemes. The idea unifying this efficient and (under the appropriate assumptions) provably-globally-convergent family of schemes is the the minimization of a search function which linearly combines a computationally-inexpensive “surrogate” (that is, an interpolation, or in some cases a regression, of recent function evaluations - we generally favor some variant of polyharmonic splines for this purpose), to summarize the trends evident in the data available thus far, with a synthetic piecewise-quadratic “uncertainty function” (built on the framework of a Delaunay triangulation of existing datapoints), to characterize the reliability of the surrogate by quantifying the distance of any given point in parameter space to the nearest function evaluations.

Nevergrad

never grad

Surrogate-based Optimization

Surrogate-Based Optimization

Surrogate-based optimization (Queipo et al. 2005, Simpson et al. 2008) represents a class of optimization methodologies that make use of surrogate modeling techniques to quickly find the local or global optima. It provides us a novel optimization framework in which the conventional optimization algorithms, e.g. gradient-based or evolutionary algorithms are used for sub-optimization(s). Surrogate modeling techniques are of particular interest for engineering design when high-fidelity, thus expensive analysis codes (e.g. Computation Fluid Dynamics (CFD) or Computational Structural Dynamics (CSD)) are used. They can be used to greatly improve the design efficiency and be very helpful in finding global optima, filtering numerical noise, realizing parallel design optimization and integrating simulation codes of different disciplines into a process chain. Here the term “surrogate model” has the same meaning as “response surface model”, “metamodel”, “approximation model”, “emulator” etc.

Delaunay-based derivative-free optimization

Like other Response Surface Methods, at each optimization step, the algorithm minimizes a metric combining an interpolation of existing function evaluations and a model of the uncertainty of this interpolation. By adjusting the respective weighting of these two terms, the algorithm incorporates a tunable balance between global exploration and local refinement; a rule to adjust this balance automatically is also presented. Unlike other methods, any well-behaved interpolation strategy may be used. The uncertainty model is built upon the framework of a Delaunay triangulation of existing datapoints in parameter space. A quadratic function which goes to zero at each datapoint is formed within each simplex of this triangulation; the union of each of these quadratics forms the desired uncertainty model. Care is taken to ensure that function evaluations are performed at points that are well situated in parameter space; that is, such that the simplices of the resulting triangulation have circumradii with a known bound. This facilitates well-behaved local refinement as additional function evaluations are performed.

Graduated Optimization

Another related method is graduated optimization, which is a global optimization technique that attempts to solve a difficult optimization problem by initially solving a greatly simplified problem, and progressively transforming that problem (while optimizing) until it is equivalent to the difficult optimization problem.Further, when certain conditions exist, it can be shown to find an optimal solution to the final problem in the sequence. These conditions are:

  • The first optimization problem in the sequence can be solved given the initial starting point.
  • The locally convex region around the global optimum of each problem in the sequence includes the point that corresponds to the global optimum of the previous problem in the sequence.
Graduated Optimization
Graduated opt

Kiefer-Wolfowitz Algorithm

In stochastic gradient descent, the estimated gradient is a partial sum of the population gradient so that it is necessary to compute the gradient of sample function. Kiefer-Wolfowitz Algorithm is the gradient-free version of stochastic gradient descent. It is a recursive scheme to approximate the minimum or maximum of the form $$ x^{k+1} =x^{k} - a_n \Delta(x^k) $$

During the n-th stage, observations $y^{\prime\prime}$ and $y^{\prime}$ are taken at the design levels $x^{\prime\prime}=x^k+c_k$ and $x^{\prime}=x^k - c_k$, respectively. And $\Delta(x^k)=\frac{y^{\prime\prime} - y^{\prime}}{2c_n}$, $a_n$ and $c_n$ are positive constants so that $c_n\to 0, \sum_{i=0}^{\infty}(a_n/c_n)^2&lt; \infty, \sum a_n =\infty$.

Multi-Level Optimization

Multi-Level Optimization is to optimize a related cheap function $\hat{f}$ when the objective function $f$ is very expensive to evaluate.

Multi-level optimization methods repeat three steps:

  1. Cheap minimization of modified $\hat{f}$: $$y^{k}=\arg\min_{x\in \mathbb{R}^p} + \left&lt;v_k, x\right&gt;.$$
  2. Use $y^{k}$ to give descent direction, $$x^{k+1} = x^k -a_k(x^k - y^k).$$
  3. Set $v_k$ to satisfy first-order coherence $$v_{k+1}=\frac{L_f}{L_F} F^{\prime}(x^{k+1}) - f^{\prime}(x^{k+1}).$$

The basic scheme of multilevel optimization

The simplified scheme of work for the multilevel optimization procedure can be represented as follows.

  1. Solving the optimization problem using a surrogate model. For this purpose, the method of indirect optimization based on the self-organization (IOSO) is used. This method allows finding the single solution for single-objective optimization or the Pareto-optimal set of solutions for multi-objective problems.
  2. For the obtained solution the indicators of efficiency are updated using the high-fidelity analysis tools.
  3. The adjustment of a current search region is performed.
  4. The adjustment of the surrogate model is performed. Depending upon the particular features of the applied mathematical simulation, the adjustment procedure can performed using several approaches. One such approach involves constructing non-linear corrective dependencies. This includes evaluation of the results obtained with different fidelity analysis tools. The other possible approach is application of nonlinear estimation of surrogate model internal parameters.
  5. Replacement of the surrogate model by the adjusted one and the return to step (1).

Reactive Search Optizmiation

Roberto Battiti and Mauro Brunato explains that how reactive search optimization, reactive affine shaker(RAS), memetic algorithm works for optimization problem without gradient in Machine Learning plus Intelligent Optimization.



And there are more topics on optimization such as this site. And more courses on optimization: