Levenberg-Marquardt algorithm for nonlinear regression

We present a Python implementation of a regularized version of the Levenberg-Marquardt algorithm for nonlinear regression. Regularization is obtained by setting prior distributions (Gaussian or Lognormal) on the model parameters. This is suitable for cases when few data are available and we want to incorporate prior knowledge in the model estimation process.

Nonlinear regression: generic problem statement

Let us consider a regression problem where a scalar variable y (target variable) must be predicted based on a vector of observables $\underline{x} \in \mathbb{R}^n$ .

We assume that the dynamics are nonlinear and, specifically, that

$y = f(\underline{x};\underline{\theta}) + \epsilon$

where $\underline{\theta} \in \mathbb{R}^k$ is a vector of unknown real parameters, f is a known deterministic function nonlinear in θ and ε is a random noise with distribution $\epsilon \sim N(0, \sigma^2)$ for some positive and unknown value of σ.

If we have N independent observations $(\underline{x}_1, y_1), ..., (\underline{x}_N, y_N)$ , we can estimate the value of θ by maximizing the log-likelihood. We can optionally choose to weight some observations more or less than others by choosing weights and assuming that $\small y_i \sim N(f(\underline{x}_i, \underline{\theta}), \frac{\sigma^2}{w_i})$ for each i.

Under these assumptions and setting for simplicity of notation

$X = \begin{bmatrix} \underline{x}_1 \\ ... \\ \underline{x}_N \end{bmatrix} \in \mathbb{R}^{N}, \underline{y} = \begin{bmatrix} y_1 \\ ...\\ y_N \end{bmatrix}\in \mathbb{R}^N, W = \text{diag} \left\{ \sqrt{w_1}, ..., \sqrt{w_N}\right\} \in \mathbb{R}^{N}$

we have that the log-likelihood is given by

$L(\underline{\theta} | X, \underline{y}) = \prod_{i = 1}^N \frac{\sqrt{w_i}}{\sigma\sqrt{2\pi}}\exp \left \{ - \frac{1}{2} w_i \cdot \left( \frac{y_i-f(\underline{x}_i;\underline{\theta})}{\sigma} \right) ^ 2 \right \} \Rightarrow \newline \log L(\underline{\theta} | X, \underline{y}) = \frac{1}{2}\sum_{i=1}^N \log w_i -\frac{N}{2} \log(2\pi) - \frac{N}{2} \log(\sigma^2) - \frac{1}{2} \sum_{i=1}^N w_i \cdot \left( \frac{y_i-f(\underline{x}_i;\underline{\theta})}{\sigma} \right) ^ 2$

Maximizing the log-likelihood is therefore equivalent to minimizing the following objective function (weighted sum of squared residuals):

$\text{Obj}(\underline{\theta}) = \frac{1}{2}\left \| W \cdot \left( \underline{y} - f(X, \underline{\theta}) \right) \right \| ^ 2$

Moreover, the maximum likelihood estimate for σ is

$\hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N w_i (y_i -f(\underline{x_i}, \underline{\theta}))^2$ .

The Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm calculates the minimum of Obj in an iterative way calculating a series of local quadratic approximations.

The algorithm requires that an initial guess $\underline{\theta}^{(0)}$ is provided for the unknown vector of parameters. The objective function Obj is then approximated locally in a neighbourhood of $\underline{\theta}^{(0)}$ with the following quadratic function (the approximation is only valid for small values of ||δ||:

$\small \begin{matrix} \text{Obj}(\underline{\theta}^{(0)} + \underline{\delta}, \sigma) \sim &\phi(\underline{\delta}) := \frac{N}{2}\log(\sigma^2) + \frac{1}{2\sigma^2} \left \| W \cdot \right( \underline{y} - f(X, \underline{\theta}^{(0)}) - J_{\underline{\theta}} f(X, \underline{\theta})_{|_{\underline{\theta} = \underline{\theta}^{(0)}}} \cdot \underline{\delta} \left)\; \right \| ^ 2 \end{matrix}$

The peculiarity here is that, thanks to the objective function's special form, we can calculate a local quadratic approximation by taking the first order expansion of f instead of the second-order expansion of the objective function itself (as we would be forced to do in the general case).

Defining for simplicity of notation

$J := J_{\underline{\theta}}f(X, \underline{\theta})_{|_{\underline{\theta} = \underline{\theta}^{(0)}}} \in \mathbb{R}^{N \times k}, \hspace{10} \underline{\epsilon} := \underline{y} - f(X, \underline{\theta}^{(0)}) \in \mathbb{R}^N$

we have that this quadratic function has a unique global minimum satisfying the equation

$\nabla_{\underline{\delta}} \text{ } \phi(\underline{\delta}) = -\frac{1}{\sigma^2} J^T W^T \cdot (\underline{\epsilon} - W J\cdot \underline{\delta}) = \underline{0}$

which is equivalent to requiring that the displacement δ solves the following linear system:

$(J^T W^T W J) \cdot \underline{\delta} = J^T W^T \cdot \underline{\epsilon}$

This picture illustrates the calculation of δ in a simple one-dimensional case:

In the picture, the displacement has been applied as it is. Actually, in practice, since the quadratic approximation is generally only valid locally, δ will just provide the displacement direction, while its module will be re-scaled according to a small positive number h (step) when updating θ:

$\underline{\theta}^{(1)} = \underline{\theta}^{(0)} + h \cdot \frac{\underline{\delta}}{\left \| \underline{\delta} \right \|}$

Regularization

The Levenberg-Marquardt algorithm can be extended to incorporate regularization terms. Seeing the problem in a Bayesian perspective, we can decide to provide a prior distribution on θ. For simplicity, we assume that the prior distributions on the different parameter components are independent, so that the global prior distribution can be factorized:

$p(\underline{\theta}) = \prod_{j=1}^k p_j(\theta_j)$

We assume moreover that the one-dimensional priors on the single components are either gaussian or lognormal:

$\text{Gaussian prior: } p_j(\theta_j) = \sqrt{\frac{\beta_j}{2 \pi}} \cdot \exp{\left\{ -\frac{1}{2} \beta_j ( \theta_j - \mu_j)^2 \right\}}$

$\text{Lognormal prior: } p_j(\theta_j) = \frac{1}{\theta_j} \sqrt{\frac{\beta_j}{2 \pi}} \cdot \exp{ \left \{ -\frac{1}{2} \beta_j (\log\theta_j - \log\mu_j)^2 \right \} }$

Reasoning in Bayesian terms, this time we estimate θ via maximum posterior instead of maximum likelihood. The posterior distribution of θ is

$p(\underline{\theta} | X, \underline{y}) = \frac{p(\underline{y} | X, \underline{\theta}) \cdot p(\underline{\theta})}{p(\underline{y} | X)}$

(using the fact that $p(\underline{\theta} | X) = p(\underline{\theta})$ ).

Keeping the same notations as before, the objective function to minimize is now

$\text{Obj}(\underline{\theta}) = -\log p(\underline{\theta} | X, \underline{y}) = \newline \newline = \text{const} +\frac{N}{2} \log(\sigma^2) +\frac{1}{2\sigma^2} \left \| W \cdot (\underline{y} - f(X, \underline{\theta})) \right \|^2 + \newline \newline +\frac{1}{2} \cdot \sum_{\{j | p(\theta_j) \text{ gaussian}\}} \beta_j (\theta_j - \mu_j)^2 \newline \newline +\sum_{\{j | p(\theta_j) \text{ lognormal}\}} \{ \log\theta_j +\frac{1}{2} \beta_j (\log\theta_j - \log\mu_j)^2\}$

(the term const incorporates terms that are independent of both θ and σ). This corresponds to minimizing a weighted sum of squared residuals plus a series of regularization terms.

The following contributions are added to φ and to the j-th component of its gradient due to the presence of prior distributions:

Gaussian prior:

$\phi(\underline{\delta}) \text{: } \text{ } \frac{1}{2} \cdot \beta_j (\theta_j + \delta_j - \mu_j)^2 \newline \newline \nabla_{\underline{\delta}} \phi(\underline{\delta})_j \text{: } \text{ } \beta_j \delta_j +\beta_j(\theta_j - \mu_j)$

Lognormal prior:

$\begin{matrix} \phi(\underline{\delta}) \text{: } & \frac{1}{2} \beta_j(\log(\theta_j + \delta_j) - \log\mu_j)^2 + \log(\theta_j + \delta_j) \sim \\ \\ & \frac{1}{2} \beta_j(\log\theta_j +\frac{\delta_j}{\theta_j} - \log\mu_j)^2 +\log\theta_j +\frac{\delta_j}{\theta_j} \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \text{ } \end{matrix} \newline \newline \newline \nabla_{\underline{\delta}} \phi(\underline{\delta})_j \text{: } \text{ } \text{ } \frac{\beta_j}{\theta_j^2} \cdot \delta_j + \{\frac{1}{\theta_j} +\frac{\beta_j}{\theta_j}\log\frac{\theta_j}{\mu_j} \}$

(We have used the first-order expansion of log in the lognormal case).

The linear system is now

$\left( \frac{J^T W^T W J}{\sigma^2} + R \right) \cdot \underline{\delta} = \frac{1}{\sigma} J^T W^T \underline{\epsilon} + \underline{r}$

where

$R_{jj} = \left\{\begin{matrix} \beta_j & & \text{if } p_j(\theta_j) = N(\mu_j, \frac{1}{\beta_j^2}) \\ \newline \\ \frac{\beta_j}{\theta_j^2} & & \text{if } p_j(\theta_j) = \text{logN}(\log\mu_j, \frac{1}{\beta_j}) \end{matrix}\right., \text{ } \text{ } \text{ } R_{ij} = 0 \text{ if } i \neq j$

$r_j = \left\{\begin{matrix} -\beta_j (\theta_j - \mu_j)& & \text{if } p_j(\theta_j) = N(\mu_j, \frac{1}{\beta_j}) \\ \newline \\ -\frac{1}{\theta_j} - \frac{\beta_j}{\theta_j} \log \frac{\theta_j}{\mu_j} & & \text{if } p_j(\theta_j) = \text{logN}(\log\mu_j, \frac{1}{\beta_j}) \end{matrix}\right.$

and σ is replaced by its maximum likelihood estimate (see above).

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
pictures		pictures
sample-data		sample-data
README.md		README.md
example.py		example.py
example_reg.py		example_reg.py
levenberg_marquardt.py		levenberg_marquardt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pictures

pictures

sample-data

sample-data

README.md

README.md

example.py

example.py

example_reg.py

example_reg.py

levenberg_marquardt.py

levenberg_marquardt.py

Repository files navigation

Levenberg-Marquardt algorithm for nonlinear regression

Nonlinear regression: generic problem statement

The Levenberg-Marquardt algorithm

Regularization

About

Releases

Packages

Languages

flowel1/nonlinear-regression

Folders and files

Latest commit

History

Repository files navigation

Levenberg-Marquardt algorithm for nonlinear regression

Nonlinear regression: generic problem statement

The Levenberg-Marquardt algorithm

Regularization

About

Topics

Resources

Stars

Watchers

Forks

Languages