Skip to content

Latest commit

 

History

History
2435 lines (1844 loc) · 142 KB

Deep Learning.md

File metadata and controls

2435 lines (1844 loc) · 142 KB

Deep Learning

https://jsideas.net/snapshot_ensemble/ https://www.deeplearningpatterns.com/doku.php?id=ensembles

Deep learning is the modern version of artificial neural networks full of tricks and techniques. In mathematics, it is nonlinear non-convex and composite of many functions. Its name -deep learning- is to distinguish from the classical machine learning "shallow" methods. However, its complexity makes it yet engineering even art far from science. There is no first principle in deep learning but trial and error. In theory, we do not clearly understand how to design more robust and efficient network architecture; in practice, we can apply it to diverse fields. It is considered as one approach to artificial intelligence.

Deep learning is a typical hierarchical machine learning model, consists of hierarchical representation of input data, non-linear evaluation and non-convex optimization. The application of deep learning are partial listed in


Father of Deep Learning
Father of Deep Learning
A history of deep learning
Three Giants' Survey in (Nature 521 p 436)
Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436)
http://principlesofdeeplearning.com/
Deep Learning in Neural Networks: An Overview
AI winter
Deep learning in wiki, Deep Learning in Scholarpedia.

More
A Brief History of Deep Learning (Part One)
On the Origin of Deep Learning
history of nn
nn_timeline
https://mitpress.mit.edu/books/deep-learning-revolution
https://blog.keras.io/the-limitations-of-deep-learning.html
MIT Deep Learning Book

Deep learning origins from neural networks and applies to many fields including computer vision, natural language processing. The architecture and optimization are the core content of deep learning models. We will focus on the first one. The optimization methods are almost the content of stochastic/incremental gradient descent.

Artificial Neural Network

Artificial neural networks are most easily visualized in terms of a directed graph. In the case of sigmoidal units, node ${s}$ represents sigmoidal unit and directed edge $e=(u,v)$ indicates that one of sigmoidal unit ${v}$'s inputs is the output of sigmoidal unit ${u}$. And there are connection between the node and itself in some kinds of neural networks. The way or topology that nodes connected is an important part of deep neural network architecture. The other part is choice of activation function. And any deep neural network can be expressed in the form of computational graph. We will talk all the details after we give a glimpse to neural network zoo.


Perceptron

Perceptron can be seen as a generalized linear model. In mathematics, it is a map $$ f:\mathbb{R}^n\rightarrow\mathbb{R}. $$

It can be decomposed into $2$ steps:

  1. Aggregate all the information: $z=\sum_{i=1}^{n}w_ix_i+b_0=(x_1,x_2,\cdots,x_n,1)\cdot (w_1,w_2,\cdots,w_n,b_0)^{T}.$
  2. Transform the information to activate something: $y=\sigma(z)$, where $\sigma$ is nonlinear such as step function.

It can solve the linearly separate problem.

Learning algorithm

We draw it from the Wikipedia page. Learning is to find optimal parameters of the model. Before that, we must feed data into the model. The training data of $n$ sample size is $$ D={(\mathbf{x}i, d_i)}{i=1}^{n} $$

where

  • $\mathbf{x}_i$ is the $n$-dimensional input vector;
  • $d_i$ is the desired output value of the perceptron for that input $\mathbf{x}_i$.

  1. Initialize the weights and the threshold. Weights may be initialized to 0 or to a small random value. In the example below, we use 0.

  2. For each example $j$ in our training set $D$, perform the following steps over the input $\mathbf{x}{j}$ and desired output $d{j}$:

    • Calculate the actual output: $$y_{j}(t)= f(\left<\mathbf{w}(t), \mathbf{x}_{j}\right>).$$
    • Update the weights: $w_{i}(t+1) = w_{i}(t) + r\cdot (d_{j}-y_{j}(t))x_{(j,i)},$ for all features $[0\leq i\leq n]$, is the learning rate.
  3. For offline learning, the second step may be repeated until the iteration error $\frac{1}{s}\sum_{j=1}^{s}|d_{j}-y_{j}(t)|$ is less than a user-specified error threshold $\gamma$, or a predetermined number of iterations have been completed, where s is again the size of the sample set.

$\color{lime}{Note}$: the perceptron model is linear classifier, i.e. the training data set $D$ is linearly separable such that the learning algorithm can converge.


---- ----
Perceptrons 人工神经网络真的像神经元一样工作吗?

More in Wikipedia page.

It is the first time to model cognition.

Activation functions

The nonlinear function $\sigma$ is conventionally called activation function. There are some activation functions in history.

  • Sign function $$ f(x)= \begin{cases} 1,&\text{if $x > 0$}\ -1,&\text{if $x < 0$} \end{cases} $$

  • Step function $$ f(x)=\begin{cases}1,&\text{if $x\geq0$}\ 0,&\text{otherwise}\end{cases} $$

  • Sigmoid function $$ \sigma(x)=\frac{1}{1+e^{-x}}. $$

  • Radical base function $$ \rho(x)=e^{-\beta(x-x_0)^2}. $$

  • TanH function $$ tanh(x)=2\sigma(2x)-1=\frac{2}{1+e^{-2x}}-1. $$

  • ReLU function $$ ReLU(x)={(x)}_{+}=\max{0,x}=\begin{cases}x,&\text{if $x\geq 0$}\ 0,&\text{otherwise}\end{cases}. $$


Feed-forward Neural Network

Representation of Feedforward Neural Network

Given that the function of a single neuron is rather simple, it subdivides the input space into two regions by a hyperplane, the complexity must come from having more layers of neurons involved in a complex action (like recognizing your grandmother in all possible situations). The "squashing" functions introduce critical non-linearities in the system, without their presence multiple layers would still create linear functions. Organized layers are very visible in the human cerebral cortex, the part of our brain which plays a key role in memory, attention, perceptual awareness, thought, language, and consciousness.[^13]

The feed-forward neural network is also called multilayer perceptron. The best way to create complex functions from simple functions is by composition. In mathematics, it can be considered as multi-layered non-linear composite function:

$$ X\to \sigma\circ (W_1X+b_1)=H_1\to \sigma\circ(W_2 H_1+b_2)=H_2 \to\cdots \sigma(WH+b)=y $$

where the notation $\circ$, $M_1,b_1,M_2,b_2,\cdots, W,b$ mean pointwise operation, the parameters in the affine mapping, respectively. Thus the data flow in the form of the chain: $$ \begin{align} f = H_1 \circ {H_2}\circ \cdots \circ{\sigma} \qquad &\text{Composite form} \ X\stackrel{\sigma}\to H_1 \stackrel{\sigma}\to H_2 \stackrel{\sigma} \to \cdots\stackrel{\sigma}\to,y \qquad &\text{Hierarchy form} \ \mathbb{R}^p\to \mathbb{R}^{l_1}\to \mathbb{R}^{l_2}\to \cdots\to \mathbb{R} \qquad& \text{Dimension} \end{align} $$

where the circle notation $\circ$ means forward composite or as the input of afterward operation. In hierarchy form, we omit the affine map. It is can be written in the recursive form: $$ \begin{align} \mathbf{z}{i} &= W{i}H_{i-1}+b_i, \ H_{i} &=\sigma\circ (\mathbf{z}{i}), \forall i{1,2,\dots,D} \tag 3 \end{align} $$ where $H{0}=X\in\mathbb{R}^p$, $b_i$ is a vector and $W_{i}$ is matrix. And the number of recursive times $D$ is called the depth of network.

  1. In the first layer, we feed the input vector $X\in\mathbb{R}^{p}$ and connect it to each unit in the next layer $W_1X+b_1\in\mathbb{R}^{l_1}$ where $W_1\in\mathbb{R}^{n\times l_1}, b_1\in\mathbb{R}^{l_1}$. The output of the first layer is $H_1=\sigma\circ(M_1X+b)$, or in another word the output of $j$th unit in the first (hidden) layer is $h_j=\sigma{(W_1X+b_1)}_j$ where ${(W_1X+b_1)}_j$ is the $j$th element of $l_1$-dimensional vector $W_1X+b_1$.
  2. In the second layer, its input is the output of first layer,$H_1$, and apply linear map to it: $W_2H_1+b_2\in\mathbb{R}^{l_2}$, where $W_2\in\mathbb{R}^{l_1\times l_2}, b_2\in\mathbb{R}^{l_2}$. The output of the second layer is $H_2=\sigma\circ(W_2H_1+b_2)$, or in another word the output of $j$th unit in the second (hidden) layer is $h_j=\sigma{(W_2H_1+b_2)}_j$ where ${(W_2H_1+b_2)}_j$ is the $j$th element of $l_2$-dimensional vector $W_2H_1+b_2$.
  3. The map between the second layer and the third layer is similar to (1) and (2): the linear maps datum to different dimensional space and the nonlinear maps extract better representations.
  4. In the last layer, suppose the input data is $H\in\mathbb{R}^{l}$. The output may be vector or scalar values and $W$ may be a matrix or vector as well as $y$.

The ordinary feedforward neural networks take the sigmoid function $\sigma(x)=\frac{1}{1+e^{-x}}$ as the nonlinear activation function while the RBF networks take the Radial basis function as the activation function such as $\sigma(x)=e^{c{|x|}_2^2}$.

In theory, the universal approximation theorem show the power of feed-forward neural network if we take some proper activation functions such as sigmoid function.

Evaluation and Optimization in Multilayer Perceptron

The problem is how to find the optimal parameters $W_1, b_1, W_2, b_2,\cdots, W, b$ ? The multilayer perceptron is as one example of supervised learning, which means that we feed datum $D={(\mathbf{x_i},d_i)}_{i=1}^{n}$ to it and evaluate it.

The general form of the evaluation is given by: $$ J(\theta)=\frac{1}{n}\sum_{i=1}^{n}\mathbb{L}[f(\mathbf{x}_i|\theta),\mathbf{d}_i] $$ where $\mathbf{d}_i$ is the desired value of the input $\mathbf{x}_i$ and $\theta$ is the parameters of multilayer perceptron. The notation $f(\mathbf{x}_i|\theta)$ is the output given parameters $\theta$. The function $\mathbb{L}$ is loss function to measure the discrepancy between the predicted value $f(\mathbf{x}_i|\theta)$ and the desired value $\mathbf{d}_i$.

In general, the number of parameters $\theta$ is less than the sample size $n$. And the objective function $J(\theta)$ is not convex.

We will solve it in the next section Backpropagation, Optimization and Regularization.


The diagram of MLP
Visualizing level surfaces of a neural network with raymarching
Evaluation for different tasks

Evaluation is to judge the models with different parameters in some sense via objective function. In maximum likelihood estimation, it is likelihood or log-likelihood; in parameter estimation, it is bias or mean square error; in regression, it depends on the case.

In machine learning, evaluation is aimed to measure the discrepancy between the predicted values and trues value by loss function. It is expected to be continuous smooth and differential but it is snot necessary. The principle is the loss function make the optimization or learning tractable. For classification, the loss function always is cross entropy; for regression, the loss function can be any norm function such as the $\ell_2$ norm; in probability models, the loss function always is joint probability or logarithm of joint probability.

In classification, the last layer is to predict the degree of belief of the labels via softmax function, i.e.

$$ softmax(z)=(\frac{\exp(z_1)}{\sum_{i=1}^{n}\exp(z_i)},\frac{\exp(z_2)}{\sum_{i=1}^{n} \exp(z_i)}, \cdots, \frac{\exp(z_n)}{\sum_{i=1}^{n}\exp(z_i)}) $$

where $n$ is the number of total classes. The labels are encoded as the one hot vector such as $\mathrm{d}=(1,0,0,\cdots,0)$. The cross entropy is defined as:

$$ \mathbf{H}(d,p)=-\sum_{i=1}^{n}d_i\log(p_i)=\sum_{i=1}^{n}d_i\log(\frac{1}{p_i}), $$

where $d_i$ is the $i$th element of the one-hot vector $d$ and $p_i=\frac{\exp(z_i)}{\sum_{j=1}^{n}\exp(z_j)}$ for all $i=1,2\dots, n.$

Suppose $\mathrm{d}=(1,0,0,\cdots,0)$, the cross entropy is $\mathbf{H}(d,p)=-\log(p_1)=\log \sum_{i=1}^{n}\exp(z_i)-z_1$. The cost function is $\frac{1}{n}\sum_{i=1}^{n}\mathbf{H}(d^{i},p^{i})$ in the training data set ${(\mathbf{x}i,d^i)}{i=1}^{n}$ where $\mathbf{x}_i$ is the features of $i$th sample and $d^i$ is the desired true target label encoded in one-hot vector meanwhile $p^{i}$ is the predicted label of $\mathbf{x}_i$. See the following links for more information on cross entropy and softmax.

VISUALIZING THE LOSS LANDSCAPE OF NEURAL NETS
VGG ResNet

Cross entropy is an example of Bregman divergence.

  • Hinge loss function

    It is a loss function for binary classification, of which the output is ${1, -1}$.

    $$ Hinge(x)=max{0, 1-tx} $$ where $t=+1$ or $t=-1$.

  • Negative logarithm likelihood function

    It is always the loss function in probabilistic models.

Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications


In regression, the loss function may simply be the squared $\ell_2$ norm, i.e. $\mathbb{L}(d,p)=(d-p)^{2}$ where ${d}$ is the desired target and $p$ is the predicted result. And the cost function is mean squared error:

$$ J(\theta)=\frac{1}{n}\sum_{i=1}^{n}[f(\mathbf{x}_i|\theta)-\mathrm{d}_i]^2. $$

In robust statistics, there are more loss functions such as Huber loss and Tukey loss.


  • Huber loss function $$ Huber_{\delta}(x)= \begin{cases} \frac{|x|^2}{2},&\text{if $|x|\leq\delta$}\ \delta(|x|-\frac{1}{2}\delta),&\text{otherwise} \end{cases} $$

  • Tukey loss function

    $$ Tukey_{\delta}(x)= \begin{cases} (1-[1-\frac{x^2}{\delta^2}]^3)\frac{\delta^2}{6}, &\text{if $|x|\leq\delta$}\ \frac{\delta^2}{6}, &\text{otherwise} \end{cases} $$

    where its derivative is called Tukey's Biweight: $$ \phi(x)= \begin{cases} x(1-\frac{x^2}{\delta^2})^2 , &\text{if $|x|\leq\delta$}\ 0, &\text{otherwise} \end{cases} $$ if we do not consider the points $x=\pm \delta$.


It is important to choose or design loss function or more generally objective function, which can select variable as LASSO or confirm prior information as Bayesian estimation. Except the representation or model, it is the objective function that affects the usefulness of learning algorithms.

The smoothly clipped absolute deviation (SCAD) penalty for variable selection in high dimensional statistics is defined by its derivative: $$ p_{\lambda}^{\prime} (\theta) = \lambda { \mathbb{I}(\theta \leq \lambda)+\frac{{(a\lambda-\theta)}_{+}}{(a-1)\lambda}\mathbb{I}(\theta > \lambda) } $$

or in another word

$$ p_{\lambda}^{\prime} (\theta) = \begin{cases} \lambda & ,\quad\text{if $\theta \leq \lambda$;} \\ \frac{ {(a\lambda-\theta)}_{+} }{ (a-1)\lambda } &,\quad\text{otherwise} \end{cases} $$

for some $a > 2$ and $\theta > 0$.

The minimax concave penalties(MCP) provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. It is defined as $$\rho(t;\lambda) = \lambda \int_{0}^{t}{(1-\frac{x}{\gamma\lambda})}_{+} \mathrm{d}x.$$

The function defined by its derivative is convenient for us to minimize the cost function via gradient-based methods.


The two page paper Eliminating All Bad Local Minima from Loss Landscapes Without Even Adding an Extra Unit modified the loss function $L(\theta)$ to eliminate all the bad local minima:

$$ \overline{L}(\theta, a,b) = L(\theta)(1+(a\exp(b)-1)^2)+\lambda a^2 $$

where $a,b\in\mathbb{R}$ are are auxiliary parameters, and and $\lambda\in\mathbb{R}^{+}$ is a regularization hyperparameter.

Its gradient with respect to the auxiliary parameters ${a,b}$ is given by $$ \frac{\partial \overline{L}(\theta, a,b)}{\partial a} = 2 L(\theta)(a\exp(b)-1)\exp(b) + 2\lambda a \ \frac{\partial \overline{L}(\theta, a,b)}{\partial b} = 2a L(\theta)(a\exp(b)-1)\exp(b) $$

so that setting them to be 0s, we could get $a=0, L(\theta) = 0$ if $|b|<\infty$.

For more on loss function see:

Backpropagation, Training and Regularization

Backpropagation

Automatic differentiation is the generic name for techniques that use the computational representation of a function to produce analytic values for the derivatives. Automatic differentiation techniques are founded on the observation that any function, no matter how complicated, is evaluated by performing a sequence of simple elementary operations involving just one or two arguments at a time. Backpropagation is one special case of automatic differentiation, i.e. reverse-mode automatic differentiation.

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule in terms of partial derivatives. Suppose $g:\mathbb{R}^{p}\to\mathbb{R}^n$ and $f:\mathbb{R}^{n}\to\mathbb{R}^{m}$, and let $b=g(a)$, $c=f(b)$, Chain rule says that $$\frac{\partial c_i}{\partial a_j}=\sum_{k}\frac{\partial c_i}{\partial b_k}\frac{\partial b_k}{\partial a_j}.$$

The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.


Suppose that $f(x)={\sigma}\circ(WH + b)$,where

  • $H =\sigma\circ(W_4H_3 + b_4)$,
  • $H_3=\sigma\circ(W_3H_2 + b_3)$,
  • $H_2=\sigma\circ(W_2H_1 + b_2)$,
  • $H_1=\sigma\circ(W_1x + b_1)$,

we want to compute the gradient $L(x_0,d_0)=|f(x_0)-d_0|^{2}_2$ with respect to all weights $W_1,W_2,W_3,W$:

$$ \frac{\partial L(x_0,d_0)}{\partial W_n^i}=\frac{\partial L(x_0,d_0)}{\partial f(x_0)}\frac{\partial f(x_0)}{\partial W_n^i}\forall i\in{1,2,\dots,l_n}, \forall,n\in{1,2,3,4} $$

and it is fundamental to compute the gradient with respect to the last layer as below.

  • the gradient of loss function with respect to the prediction function: $$\frac{\partial L(x_0,d_0)}{\partial f(x_0)}=2[f(x_0)-d_0],$$

  • the gradient of each unit in prediction function with respect to the weight in the last layer: $$ \frac{\partial f^{j}(x_0)}{\partial W^j}= \frac{\partial \sigma(W^jH+b^j)}{\partial W^j}= {\sigma}^{\prime}(W^jH+b^j) H ,,\forall j\in{1,2,\dots,l}, $$

  • the gradient of prediction function with respect to the last hidden state: $$ \frac{\partial f^{j}(x_0)}{\partial H} = \frac{\partial \sigma(W^jH + b^j)}{\partial H} = {\sigma}^{\prime}(W^jH + b^j) W^j ,,\forall j\in{1,2,\dots,l}, $$ where $f^{j}(x_0)$, $W^{j}$, $b^j$ and $\sigma^{\prime}(z)$ is the j-th element of $f(x_0)$, the j-th row of matrix $W$, the j-th element of vector ${b}$ and $\frac{\mathrm{d}\sigma(z)}{\mathrm{d} z}$, respectively.

The Architecture of Feedforward Neural Networks
Each connection ,the black line, is attached with a weight parameter.

Recall the chain rule with more variables: $$ \frac{\partial f(m(x_0),n(x_0))}{\partial x_0}=\frac{\partial f(m(x_0),n(x_0))}{\partial m(x_0)}\frac{\partial m(x_0)}{\partial x_0} + \frac{\partial f(m(x_0),n(x_0))}{\partial n(x_0)}\frac{\partial n(x_0)}{\partial x_0}. $$

Similarly , we can compute following gradients: $$\frac{\partial H^j}{\partial W_4^j} =\frac{\partial \sigma(W_4^j H_3+b_4^j)}{\partial W_4^j} =[\sigma^{\prime}(W_4^j H_3+b_4^j)H_3]^T \qquad\forall j\in{1,2,\dots,l};$$

$$\frac{\partial H^j}{\partial H_3} =\frac{\partial \sigma(W_4^j H_3+b_4^j)}{\partial H_3} =[\sigma^{\prime}(W_4^j H_3+b_4^j)W_4^j]^T \qquad\forall j\in{1,2,\dots,l_4};$$

$$\frac{\partial H_3^j}{\partial W_3^j}=\frac{\partial \sigma(W_3^j H_2+b_3^j)}{\partial W_3^j} =[\sigma^{\prime}(W_3^j H_2+b_3^j)H_2 ]^T \qquad\forall j\in{1,2,\dots,l_3};$$

$$\frac{\partial H_3^j}{\partial H_2} =\frac{\partial \sigma(W_3^j H_2+b_3^j)}{\partial H_2} =[\sigma^{\prime}(W_3^j H_2+b_3^j)W_3^j]^T \qquad\forall j\in{1,2,\dots,l_3};$$

$$\frac{\partial H_2^j}{\partial W_2^j}=\frac{\partial \sigma(W_2^j H_1+b_2^j)}{\partial W_2^j} =[\sigma^{\prime}(W_2^j H_1+b_2^j)H_1 ]^T \qquad\forall j\in{1,2,\dots,l_2};$$

$$\frac{\partial H_2^j}{\partial H_1} =\frac{\partial \sigma(W_2^j H_1+b_2^j)}{\partial H_1} =[\sigma^{\prime}(W_2^j H_1+b_2^j)W_2^j]^T \qquad\forall j\in{1,2,\dots,l_2};$$

$$\frac{\partial H_1^j}{\partial W_1^j}=\frac{\partial \sigma(W_1^j x_0+b_1^j)}{\partial W_1^j} =[\sigma^{\prime}(W_1^j x_0+b_1^j)x_0]^T \qquad\forall j\in{1,2,\dots,l_1}.$$


The multilayer perceptron $f(x)$ can be written in a chain form: $$ X\stackrel{\sigma}{\to} H_1 \stackrel{\sigma}{\to} H_2\stackrel{\sigma} \to H_3 \stackrel{\sigma} \to H_4 \stackrel{\sigma} \to H\stackrel{\sigma}\to, f(x) \ X\rightarrow W_1 X \rightarrow W_2H_1 \rightarrow W_3H_2 \rightarrow W_4H_3 \rightarrow WH \rightarrow y \ \mathbb{R}^{p}\to \mathbb{R}^{l_1}\to \mathbb{R}^{l_2}\to \mathbb{R}^{l_3}\to \mathbb{R}^{l}\to \mathbb{R}^{o} $$

while the backpropagation to compute the gradient is in the reverse order:

$$ \frac{\partial y}{\partial W}\to \frac{\partial y}{\partial H}\to \frac{\partial H}{\partial W_4}\to \frac{\partial H}{\partial H_3}\to \frac{\partial H_3}{\partial W_3}\to \frac{\partial H_3}{\partial H_2}\to \frac{\partial H_2}{\partial W_2}\to \frac{\partial H_2}{\partial W_1}\to \frac{\partial H_1}{\partial W_1}. $$

In general, the gradient of any weight can be computed by backpropagation algorithm.

The first step is to compute the gradient of squared loss function with respect to the output $f(x_0)=y\in\mathbb{R}^{o}$, i.e. $\frac{\partial L(x_0, d_0)}{\partial f(x_0)}=2(f(x_0)-d_0)=2(\sigma\circ(WH+b)-d_0)$ , of which the $i$th element is $2(y^{i}-d_0^i)=2(\sigma(W^{i}H+b^{i})-d_0^{i}),\forall i{1,2,\dots,o}$. Thus $$\frac{\partial L(x_0, d_0)}{\partial W^{i}}=\frac{\partial L(x_0, d_0)}{\partial y^{i}}\frac{\partial y^{i}}{\partial W^{i}}=2(y^{i} -d_0^i)\sigma^{\prime}(W^iH+b^i)[H]^T,\ \frac{\partial L(x_0, d_0)}{\partial b^{i}}=\frac{\partial L(x_0, d_0)}{\partial y^{i}}\frac{\partial y^{i}}{\partial b^{i}}=2(y^{i}-d_0^i)\sigma^{\prime}(W^iH + b^i) b^i.$$ Thus we can compute all the gradients of $W$ columns. Note that $H$ has been computed through forwards propagation in that layer.

And $H=\sigma\circ(W_4H_3+b_3)$, of which the $i$ th element is $$H^{i}=\sigma(W_4 H_3 +b_4)^{i}=\sigma(W_4^{i} H_3+b_4^{i}).$$

And we can compute the gradient of columns of $W_4$:

$$\frac{\partial L(x_0,y_0)}{\partial W_4^i}$$ where $y_0=f(x_0)=\sigma\circ[W\underbrace{\sigma\circ(W_4H_3+b_4)}_{H}+b]$.

and by the chain rule we obtain $$\frac{\partial L(x_0,y_0)}{\partial W_4^i} =\sum_{j=1}^{o}\underbrace{\frac{\partial L(x_0,y_0)}{\partial y^j}}{\text{computed in last layer} } \sum{n}(\overbrace{\frac{\partial y^j}{\partial H^n}}^{\text{the hidden layer}} \frac{\partial H^n}{\partial W_4^i})$$

$$ = \color{aqua}{ \sum_{j=1}^{o} \frac{\partial L}{\partial y^j}}\color{blue}{ \sum_{n}(\underbrace{\frac{\partial, y^j}{\partial (W^jH+b^j)} \frac{\partial (W^jH + b^j)}{\partial H^{n}}}{\color{purple}{\frac{\partial y^j}{ \partial H^i}} } \frac{\partial (H^{n})}{\partial W_4^i}) } \ = \sum{j=1}^{o} \frac{\partial L}{\partial y^j} \sum_{n}(\frac{\partial, y^j}{\partial (W^jH+b^j)} \frac{\partial (W^jH + b^j)}{\partial H^{n}} \color{green}{\underbrace{\sigma^{\prime}(W^i_4 H_3+b^i_4)[H_3]^T}_{\text{computed after forward computation}} }) , $$

where $W^{j,i}$ is the $i$th element of $j$th row in matrix $W$.

$$ \frac{\partial L(x_0,y_0)}{\partial W_3^i}= \sum_{j=1}^{o} \frac{\partial L}{\partial y^j} {\sum_m[\sum_{n}(\frac{\partial y^j}{\partial H^n} \overbrace{\frac{\partial H^n}{\partial H_3^m} }^{\triangle})] \underbrace{\frac{\partial H_3^m}{\partial W_3^i} }_{\triangle} } $$ where all the partial derivatives or gradients have been computed or accessible. It is nothing except to add or multiply these values in the order when we compute the weights of hidden layer.

$$ \frac{\partial L(x_0,y_0)}{\partial W_2^i}= \sum_{j=1}^{o}\frac{\partial L}{\partial y^j} \fbox{$\sum_{l}{\underbrace{\sum_m[\sum_{n}(\frac{\partial y^j}{\partial H^n} \frac{\partial H^n}{\partial H_3^m} )] }_{\text{computed in last layer}} \frac{\partial H_3^m}{\partial H_2^l} } \frac{\partial H_2^l}{\partial W_2^i}$} $$

And the gradient of the first layer is computed by $$ \frac{\partial L(x_0,y_0)}{\partial W_1^i} =\sum_{j=1}^{o}\frac{\partial L}{\partial y^j} \big( \sum_{p}\left(\sum_{l}{\sum_m[\sum_{n}(\frac{\partial y^j}{\partial H^n} \frac{\partial H^n}{\partial H_3^m} )] \frac{\partial H_3^m}{\partial H_2^l} } \frac{\partial H_2^l}{\partial H_1^p} \right)\frac{\partial H_1^p}{\partial W_1^i} \big). $$

See more information on backpropagation in the following list

Fundamental Problem of Deep Learning and Activation Functions

Gradients vanishing is the fundamental problem of deep neural networks according to Juergen. The the 1991 diploma thesis of Sepp Hochreiter formally showed that deep neural networks are hard to train, because they suffer from the now famous problem of vanishing or exploding gradients: in typical deep or recurrent networks, back-propagated error signals either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers, or they explode.

This fundamental problem makes it impossible to decrease the total loss function via gradient-based optimization problems. One direction solution is to replace the sigmoid functions in order to take the advantages of gradient-based methods. Another approach is to minimize the cost function via gradient-free optimization methods such as simulated annealing.

The sigmoid function as the first function to replace the step function actually is a cumulative density function (CDF), which satisfy the following conditions:

  • The function, $\sigma(x) = \frac{1}{1+exp(-x)}$, is left continuous;
  • $0 \leq \sigma(x) \leq 1, \forall x\in\mathbb{R}$;
  • $\lim_{x \to - \infty}\sigma(x)=0, \lim_{x\to\infty}\sigma(x)=1$.

It is called logistic function in statistics. This function is easy to explain as a continuous alternative to the step function. And it is much easier to evaluate than the common normal distribution.

And the derivative function of this function is simple to compute if we know the function itself:

$$\sigma^{\prime}(x) = \sigma(x)(1 - \sigma(x)).$$

It is clear that $\arg\max_{x} \sigma^{\prime}(x) = \frac{1}{4}< 1$, which can make the gradients vanish during the back-propagation.

By Back-Propagation, we obtain that $$ \frac{\partial (\sigma(\omega^T x + b))}{\partial \omega} =\frac{\partial (\sigma(\omega^T x + b))}{\partial (\omega^Tx + b)}\cdot\frac{\partial (\omega^T x + b)}{\partial \omega} \= [\sigma(\omega^Tx + b)\cdot(1-\sigma(\omega^T x + b))] x. $$

The problems of Vanishing gradients can be worsened by saturated neurons. Suppose, that pre-activation ${\omega}^T x + b$ that is fed to a neuron with a Sigmoid activation is either very high or very low. The gradient of sigmoid at very high or low values is almost 0. Any gradient update would hardly produce a change in the weights $\omega$ and the bias $b$, and it would take a lot of steps for the neuron to get modify weights so that the pre-activation falls in an area where the gradient has a substantial value.

And if all the data points fed into layers are close to ${0}$, it may help to solve the vanishing gradients problems. And the sigmoid function $\sigma(x)$ is always non-negative and is not symmetric about ${0}$ so we may modify this function into

$$ tanh(x) = 2\sigma(2x) - 1 = \frac{2}{1 + \exp(-2x)}-1 \= \frac{1- \exp(-2x)}{1 + \exp(-2x)} \=\frac{\exp(x) - \exp(-x)}{\exp(x)+\exp(-x)}\in(-1, 1). $$ Sometimes it is alos called Bipolar Sigmoid.

And its gradient is $$ tanh^{\prime}(x)= 4\sigma^{\prime}(x) = 4\sigma(x)(1-\sigma(x))\in (0,1]. $$

Another profit of Exponential function of feature vector $\exp(x)$ is that complex interaction of the features although it is expensive at computation.

And an alternative function is so-called hard tanh: $f(x)=\max(-1, \min(1, x))$.


The first attempt at curbing the problem of vanishing gradients in a general deep network setting (LSTMs were introduced to combat this as well, but they were restricted to recurrent models) was the introduction of the ReLU activation function: $$ ReLU(x)= \max{x, 0}={(x)}_{+} \=\begin{cases} x, & \text{if}\quad x\geq 0; \ 0, & \text{otherwise}. \end{cases} $$

If we do not consider ${0}$, the gradient of ReLU is computed as $$ ReLU^{\prime}(x)= \begin{cases} 1, & \text{if}\quad x > 0; \ 0, & \text{if}\quad x < 0. \end{cases} $$

And to be technological, ReLU is not continuous at 0.

The product of gradients of ReLU function doesn't end up converging to 0 as the value is either 0 or 1. If the value is 1, the gradient is back propagated as it is. If it is 0, then no gradient is backpropagated from that point backwards.

We had a two-sided saturation in the sigmoid functions. That is the activation function would saturate in both the positive and the negative direction. In contrast, ReLUs provide one-sided saturations.

ReLUs come with their own set of shortcomings. While sparsity is a computational advantage, too much of it can actually hamper learning. Normally, the pre-activation also contains a bias term. If this bias term becomes too negative such that $\omega^T x + b &lt; 0$, then the gradient of the ReLU activation during backward pass is 0. Therefore, the weights and the bias causing the negative pre-activations cannot be updated.

If the weights and bias learned is such that the pre-activation is negative for the entire domain of inputs, the neuron never learns, causing a sigmoid-like saturation. This is known as the dying ReLU problem.

In order to combat the problem of dying ReLUs, the leaky ReLU was proposed. A Leaky ReLU is same as normal ReLU, except that instead of being 0 for $x \leq 0$, it has a small negative slope for that region.

$$ f(x) = \begin{cases} x, & \text{if}\quad x > 0;\\ \alpha x, & \text{otherwise}. \end{cases} $$

Randomized Leaky ReLU

$$ f(x) = \begin{cases} x, & \text{if} \quad x > 0;\\ \alpha x, & \text{otherwise}. \end{cases} \\ \alpha\sim U[0,1] $$

Parametric ReLU

$$ f(x) = \begin{cases} x, & \text{if} \quad x > 0;\ \alpha x, & \text{otherwise}. \end{cases} $$ where this $\alpha$ can be learned since you can backpropagate into it.

ELU(Exponential Linear Units) is an alternative of ReLU:

$$ ELU(x)= \begin{cases} x, &\text{if}\quad x > 0;\\ \alpha(\exp(x)-1), &\text{otherwise}. \end{cases} $$


SWISH:

$$ \sigma(x) = x\cdot sigmoid(x) = \frac{x}{1+e^{-x}} \in (0,+\infty) $$

The derivative function of SWISH is given by $$ \sigma^{\prime}(x)= sigmoid(x) + x\cdot {sigmoid^{\prime}(x)} \=\frac{1}{1+e^{-x}}[1+x(1-\frac{1}{1+e^{-x}})] \=\frac{1+e^{-x} + xe^{-x}}{(1+e^{-x})^2}. $$ And $\lim_{x\to +\infty}\frac{SWISH}{x}=\lim_{x\to +\infty}\frac{1}{1+e^{-x}}=1$.

Soft Plus: It is also known as Smooth Rectified Linear Unit, Smooth Max or Smooth Rectifier. $$ f(x)=\log(1+e^{x})\in(0,+\infty). $$ And $\lim_{x\to\infty}\frac{f(x)}{x}=1$.

Its derivative function is the sigmoid function: $$ f^{\prime}(x)=\frac{e^x}{1 + e^x} = \frac{1}{1 + e^{-x}}\in (0,1). $$

SoftSign:

$$ f(x) = \frac{x}{1+|x|} \in (-1, 1) \\ f^{\prime}(x)=\frac{1}{(1+|x|)^2}\in(0, 1) $$

The sign function is defined as

$$ sgn(x)= \begin{cases} 1, &\text{if $x > 0$}\\ 0, &\text{if $x=0$} \\ - 1, &\text{if $x < 0$} \end{cases} \=\frac{x}{|x|}\quad\text{if $x\not= 0$}. $$

The soft sign function is not continuous at ${0}$ as well as sign function while it is smoother than sign function and without any leap points.

Softsign is another alternative to Tanh activation. Like Tanh, it's anti-symmetrical, zero centered, differentiable and returns a value between -1 and 1. Its flatter shape and more slowly declining derivative suggest that it may learn more efficiently. On the other hand, calculation of the derivative is more computationally cumbersome than Tanh.

Bent identity: $$ f(x)=\frac{\sqrt{x^2+1}-1}{2}+x \in(-\infty, +\infty) \ f^{\prime}(x) = \frac{x}{2\sqrt{x^2+1}} + 1\in(0.5, 1.5) $$

A sort of compromise between Identity and ReLU activation, Bent Identity allows non-linear behaviours, while its non-zero derivative promotes efficient learning and overcomes the issues of dead neurons associated with ReLU. As its derivative can return values either side of 1, it can be susceptible to both exploding and vanishing gradients.

Sigmoid Function
Sigmoidal function Non-saturation function
Sigmoid ReLU
Tanh ELU
Hard Tanh Leaky ReLU
Sign SWISH
SoftSign Soft Plus

The functions in the right column are approximate to the identity function in $\mathbb{R}^{+}$, i.e., $\lim_{x\to +\infty}\frac{f(x)}{x} = 1$. And activation function ${f}$ is potential to result in gradient vanishing if its derivative is always less than 1: $f^{\prime}(x) &lt; 1 \quad\forall x$. The non-saturation function can result in gradient exposition. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent


Training Methods

The training is to find the optimal parameters of the model based on the training data set. The training methods are usually based on the gradient of cost function as well as back-propagation algorithm in deep learning. See Stochastic Gradient Descent in Numerical Optimization for details. In this section, we will talk other optimization tricks such as Normalization.

Concepts Interpretation
Overfitting and Underfitting See Overfitting or Overfitting and Underfitting With Machine Learning Algorithms
Memorization and Generalization Memorizing, given facts, is an obvious task in learning. This can be done by storing the input samples explicitly, or by identifying the concept behind the input data, and memorizing their general rules. The ability to identify the rules, to generalize, allows the system to make predictions on unknown data. Despite the strictly logical invalidity of this approach, the process of reasoning from specific samples to the general case can be observed in human learning. From https://www.teco.edu/~albrecht/neuro/html/node9.html.
Normalization and Standardization Normalization is to scale the data into the interval [0,1] while Standardization is to rescale the datum with zero mean $0$ and unit variance $1$. See Standardization vs. normalization.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

See Improve the way neural networks learn at http://neuralnetworksanddeeplearning.com/chap3.html. See more on nonconvex optimization at http://sunju.org/research/nonconvex/.

For some specific or given tasks, we would choose some proper models before training them.

Given machine learning models, optimization methods are used to minimize the cost function or maximize the performance index.

Initialization and More

As any optimization methods, initial values affect the convergence. Some methods require that the initial values must locate at the convergence region, which means it is close to the optimal values in some sense.

Initialization can have a significant impact on convergence in training deep neural networks. Simple initialization schemes have been found to accelerate training, but they require some care to avoid common pitfalls. In this post, we'll explain how to initialize neural network parameters effectively.

Initialization

Learning Rate
Batch Size
Distributed Training of Neural Networks

Regularization

In mathematics, statistics, and computer science, particularly in the fields of machine learning and inverse problems, regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent over-fitting. In general, regularization is a technique that applies to objective functions in ill-posed optimization problems. It changes the objective function or more generally the optimization procedure. However, it is not crystal clear that what is the relationship between the optimization techniques and generalization ability. See the following links for more information on optimization and generalization.

Parameter norm penalty

The $\ell_2$ norm penalty is to add the squares of $\ell_2$ norm of parameters to the objective function $J(\theta)$ to reduce the parameters(or weights) as shown in ridge regression with regular term coefficient $\lambda$, i.e. $J(\theta)+\lambda {|\theta|}{2}^{2}.$ Suppose $E(\theta)=J(\theta)+\lambda {|\theta|}{2}^{2}$, the gradient descent take approximate (maybe inappropriate) form $$ \theta=\theta-\eta\frac{\partial E(\theta)}{\partial \theta}=\theta -\eta\frac{\partial J(\theta)}{\partial \theta}-2\eta\lambda \theta $$ thus $$ \frac{\partial J(\theta)}{\partial \theta} = -2\lambda\theta\implies J(\theta)=e^{-2\lambda \theta}. $$

So $\lim_{\lambda\to \infty}J(\theta)\to 0$. If we want to find the minima of $E(\theta)$, $\theta$ will decay to $0$. It extends to the following iterative formula: $$ \theta^{t+1} = (1-\lambda)\theta^{t}-\alpha_{t}\frac{\partial J(\theta^{t})}{\partial \theta}, $$ where $\lambda$ determines how you trade off the original cost $J(\theta)$ with the large weights penalization. The new term $\lambda$ coming from the regularization causes the weight to decay in proportion to its size.

The $\ell_1$ norm penalty is also used in deep learning as in LASSO. It is to solve the following optimization problem: $$\min_{\theta}J(\theta)+\lambda{|\theta|}_1,$$ where $\lambda$ is a hyperparameter. Sparsity brings to the model as shown as in LASSO.

Max norm constraints is to set an upper bound to regularize the networks, i.e., it is to minimize the Constrained cost function $$ J(\theta), \qquad s.t. \qquad |\theta | \leq c. $$

Consider the fact that the parameters or weights are always in the matrix form, i.e., $$\theta={W_1, W_2, \dots, W_n}$$

the regularization terms sometimes are in the sum of norm of matrix in each layer.

Tangent prop is to minimize the cost function with penalty on gradient: $$ J(\theta)+\sum_{i} [(\nabla_x f(x)^T v^{(i)})]^2 $$

Early stop

Its essential is to make a balance in memorization and generalization. Early stopping is to stop the procedure before finding the minima of cost in training data. It is one direct application of cross validation.

Dropout

It is to cripple the connections stochastically, which is often used in visual tasks. See the original paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

Data augmentation

Data augmentation is to augment the training datum specially in visual recognition. Overfitting in supervised learning is data-dependent. In other words, the model may generalize better if the data set is more diverse. It is to collect more datum in the statistical perspective.


Feed forward and Propagate backwards
TF

Ablation Studies

Ablation studies have been widely used in the field of neuroscience to tackle complex biological systems such as the extensively studied Drosophila central nervous system, the vertebrate brain and more interestingly and most delicately, the human brain. In the past, these kinds of studies were utilized to uncover structure and organization in the brain, i.e. a mapping of features inherent to external stimuli onto different areas of the neocortex. considering the growth in size and complexity of state-of-the-art artificial neural networks (ANNs) and the corresponding growth in complexity of the tasks that are tackled by these networks, the question arises whether ablation studies may be used to investigate these networks for a similar organization of their inner representations. In this paper, we address this question and performed two ablation studies in two fundamentally different ANNs to investigate their inner representations of two well-known benchmark datasets from the computer vision domain. We found that features distinct to the local and global structure of the data are selectively represented in specific parts of the network. Furthermore, some of these representations are redundant, awarding the network a certain robustness to structural damages. We further determined the importance of specific parts of the network for the classification task solely based on the weight structure of single units. Finally, we examined the ability of damaged networks to recover from the consequences of ablations by means of recovery training.

Convolutional Neural Network

Convolutional neural network is originally aimed to solve visual tasks. In so-called Three Giants' Survey, the history of ConvNet and deep learning is curated. Deep, Deep Trouble--Deep Learning’s Impact on Image Processing, Mathematics, and Humanity tells us the mathematicians' impression on ConvNet in image processing.

Convolutional Layer

Convolutional layer consists of padding, convolution, pooling.

Convolution Operation

Convolution operation is the basic element of convolution neural network. We only talk the convolution operation in 2-dimensional space.

The image is represented as matrix or tensor in computer:

$$ M= \begin{pmatrix} x_{11} & x_{12} & x_{13} & \cdots & x_{1n} \ x_{21} & x_{22} & x_{23} & \cdots & x_{2n} \ \vdots & \vdots & \vdots & \ddots & \vdots \ x_{m1} & x_{m2} & x_{m3} & \cdots & x_{mn} \ \end{pmatrix} $$ where each entry $x_{ij}\in{1,2\cdots,256},i\in{1,2,\cdots,m},j\in{1,2,\cdots,n}$.


  • Transformation as matrix multiplication
    • The $M$ under the transformation $A$ is their product -$MA$- of which each column is the linear combination of columns of the matrix $M$. The spatial information or the relationship of the neighbors is lost.
  • Spatial information extraction
    • Kernel in image processing takes the relationship of the neighboring entries into consideration. It transforms the neighboring entries into one real value. It is the local pattern that we can learn.

In mathematics, the matrix space $\mathbb{M}^{m\times n}$ in real domain, which consists of the matrix with $m\times n$ size, is linear isomorphic to the linear space $\mathbb{R}^{m\times n}$. And thus in 2-dimenional space, convolution corresponds to $\color{aqua}{\text{doubly block circulant matrix}}$ if the matrix is flatten.

Let $\otimes$ denote the convolution operator and $K\in\mathbb{R}^{w\times l}$, then we obtain that $$ F=M\otimes K \in \mathbb{R}^{(m-1)\times (n-1)}, $$

where $F_{ij}=\sum_{l=0}^{w-1}\sum_{m=0}^{l-1} M_{(i+l),(j+m)}K_{(l+1),(m+1)}$.

So that $\frac{\partial F_{ij}}{\partial K_{l,m}}=M_{(i+l-1),(j+m-1)}$.

The illustration of convolution operator
The Effect of Filter
(http://cs231n.github.io/assets/conv-demo/index.html)

As similar as the inner product of vector, the convolution operators can compute the similarity between the submatrix of images and the kernels (also called filters).

The convolution operators play the role as parameter sharing and local connection.

For more information on convolution, click the following links.

Padding

The standard convolution operation omit the information of the boundaries. Padding is to add some $0$s outside the boundaries of the images.

Zero padding

https://zhuanlan.zhihu.com/p/36278093

Activation

As in feedforward neural networks, an additional non-linear operation called ReLU has been used after every Convolution operation.

Pooling as Subsampling

Pooling as subsampling is to make the model/network more robust or transformation invariant. Spatial Pooling (also called subsampling or down-sampling) is to use some summary statistic that extract from spatial neighbors, which reduces the dimensionality of each feature map but retains the most important information. The function of Pooling is

  • to progressively reduce the spatial size of the input representation and induce the size of receptive field.
  • makes the network invariant to small transformations, distortions and translations in the input image.
  • helps us arrive at an almost scale invariant representation of our image (the exact term is "equivariant").

See more in the following links:

Max Pooling

It is to use the maximum to represent the local information.

See https://www.superdatascience.com/convolutional-neural-networks-cnn-step-2-max-pooling/.

Sum Pooling

It is to use the sum to represent the local information.

Average Pooling

It is to use the average to represent the local information.

Random Pooling

It is to draw a sample from the receptive field to represent the local information. https://www.cnblogs.com/tornadomeet/p/3432093.html

CNN

Convolution neural network (conv-net or CNN for short) is the assembly of convolution, padding, pooling and full connection , such as $$ M\stackrel{Conv 1}{\to}H_1 \stackrel{Conv 2}{\to} H_2 \dots \stackrel{Conv l}{\to}{H} \stackrel{\sigma}{\to} y. $$ In the $i$th layer of convolutional neural network, it can be expressed as

$$ \hat{H}{i} = P\oplus H{i-1} \ \tilde{H_i} = C_i\otimes(\hat{H}_{t}) \ Z_i = \mathrm{N}\cdot \tilde{H_i} \ H_i = Pooling\cdot (\sigma\circ Z_i) $$

where $\otimes,\oplus,\cdot$ represent convolution operation, padding and pooling, respectively.

Diagram of Convolutional neural network

The outputs of each layer are matrices or tensors rather than real vectors in CNN. The vanila CNN has no feedback or loops in the architecture.

Optimization in CNNs

Backpropagation in CNNs

Convolutional neural networks extract the features of the images by trial and error to tune the convolutions ${C_i \mid i=1,2,\dots, n}$.

At each convolutional layer, we feed forwards the information like:

$$ \hat{H}{i} = P \oplus H{i-1} \ \tilde{H_i} = C_i \otimes(\hat{H}_{t}) \ Z_i = \mathrm{N}\cdot \tilde{H_i} \ H_i = Pooling\cdot (\sigma\circ Z_i) $$

In order to design proper convolutions for specific task automatically according to the labelled images via gradient-based optimization methods, we should compute the gradient of error with respect to convolutions:

$$ \frac{\partial L}{\partial H_n} \to \frac{\partial H_n}{\partial C_n} \to \frac{\partial H_n}{\partial H_{n-1}}\to\cdots \to \frac{\partial H_1}{\partial M}. $$ In fully connected layer, the backpropagation is as the same as in feedforward neural networks.

In convolutional layer, it is made up of padding, convolution, activation, normalization and pooling.

Note that even that there is only one output of the convolutional layer (for example when the kernels or filters are in the same form of input without pooling), it is accessible to compute the gradient of the kernels like the inner product.

For example, we compute the gradient with respect to the convolution:

$$ F:\mathbb{R}^{m\times n} \to \mathbb{R}^{\hat{m}\times \hat{n}}\\ F = Pooling \cdot \mathrm{N} \cdot \sigma[(P \oplus M) \otimes C], $$

which is a composite of operators/functions. So that by the chain rule of derivatives

$$ \frac{\partial F}{\partial K} = \frac{\partial Pooling}{\partial n} \frac{\partial n}{\partial \sigma} \frac{\partial \sigma}{\partial C} \frac{\partial C}{\partial M}. $$

Backward Propagation of the Pooling Layers

$$\frac{\partial Pooling}{\partial n}$$

Backward Propagation of the Normalization Layers

$$\frac{\partial n}{\partial \sigma}$$

Batch Normalization

It is an effective way to accelerate deep learning training.

And Batch Normalization transformation is differentiable so that we can compute the gradient in backpropagation. For example, let $\ell$ be the cost function to be minimized, then we could compute the gradients

$$ \frac{\partial \ell}{\partial \hat{x}_i} = \frac{\partial \ell}{\partial y_i} \frac{\partial y_i}{\partial \hat{x}_i} = \frac{\partial \ell}{\partial y_i} \gamma $$


$$ \frac{\partial \ell}{\partial \sigma_B^2} = \sum_{i=1}^{m} \frac{\partial \ell}{\partial \hat{x}_i} \frac{\partial \hat{x}i}{\partial \sigma_B^2} = \sum{i=1}^{m}\frac{\partial \ell}{\partial \hat{x}_i}\frac{-\frac{1}{2}(x_i - \mu_B)}{(\sigma_B^2 + \epsilon)^{\frac{3}{2}}} $$


$$ \frac{\partial \ell}{\partial \mu_B} = \frac{\partial \ell}{\partial \sigma_B^2} \frac{\partial \sigma_B^2}{\partial \mu_B} + \sum_{i=1}^{m}\frac{\partial \ell}{\partial \hat{x}i} \frac{\partial \hat{x}i}{\partial \mu_B} = \frac{\partial \ell}{\partial \sigma_B^2}[\frac{2}{m}\sum{i=1}^{m}(x_i-\mu_B)] + \sum{i=1}^{m}\frac{\partial \ell}{\partial \hat{x}_i}\frac{-1}{\sqrt{(\sigma_B^2 + \epsilon)}} $$


$$ \frac{\partial \ell}{\partial {x}_i} = \frac{\partial \ell}{\partial \hat{x}_i} \frac{\partial \hat{x}_i}{\partial {x}_i} + \frac{\partial \ell}{\partial \mu_B} \frac{\partial \mu_B}{\partial {x}_i} + \frac{\partial \ell}{\partial \sigma^2_B} \frac{\partial \sigma^2_B}{\partial {x}_i} $$

$$ \frac{\partial \ell}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial \ell}{\partial y_i} \hat{x}i \ \frac{\partial \ell}{\partial \beta} = \sum{i=1}^{m} \frac{\partial \ell}{\partial y_i} \hat{x}_i $$

Batch Normalization in Neural
BN
Batch Normalization in Neural Network

Layer Normalization

Different from the Batch Normalization, Layer Normalization compute the average $\mu$ or variance $\sigma$ for all the inputs of the given hidden layer, as $$ \mu = \frac{1}{n}\sum_{i}^{n} x_i \ \sigma = \sqrt{\frac{1}{n}\sum_{i}^{n}(x_i - \mu)^2 + \epsilon} $$ where ${n}$ is the number of units in the given hidden layer.

Layer Normalization
LN
Layer Normalization in FNN

Weight Normalization

Backward Propagation of the Activation Layers

$$\frac{\partial \sigma}{\partial C}$$

$\sigma$ $\frac{\partial \sigma}{\partial C}$
activation function the partial derivatives after convolution

Backward Propagation of the Convolution Layers

$$\frac{\partial C}{\partial M}$$

Data Augmentation

Interpretation of CNN

It has shown that

ImageNet trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies.

Recurrent Neural Networks and Long Short-Time Memory

601.765 Machine Learning: Linguistic & Sequence Modeling Scientia est Potentia

Recurrent neural networks are aimed to handle sequence data such as time series. Recall the recursive form of feedforward neural networks: $$ \begin{align} \mathbf{z}i &= W_i H{i-1}+b_i, \ H_{i} &= \sigma\circ(\mathbf{z}_i), \end{align} $$

where $W_i\in\mathbb{R}^{l_{i}\times l_{i-1}}$, $H_i (\text{as well as}, b_i)\in \mathbb{R}^{l_i}\forall i\in{1, 2, \dots, D}$ and $\sigma$ is activation function. In convention, $H_0$ is defined as input $X\in\mathbb{R}^{p}$. In short, the formula in the $i$th layer of feedforward neural network is given by $$ H_i=\sigma\circ(W_i H_{i-1}+b_i). $$

It may suit the identically independently distributed data set. However, the sequence data is not identically independently distributed in most cases. For example, the outcome of current decision determines the next decision.

In mathematics it can be expressed as

$$ H_{t}=\sigma\circ(X_t,H_{t-1})=\sigma\circ(W H_{t-1} + U X_{t}+b), $$

where $X_{t}\in\mathbb{R}^{p}$ is the output $\forall t\in{1,2\dots,\tau}$. The Hidden Variable sequence $H_{t}$ captures the lossy history of the $X_t$ sequence, and hence serves as a type of memory. We can compute the gradient with respect to the parameters $W, U, b$:

$$ \frac{\partial H_t}{\partial W} = \sigma^{\prime} \circ H_{t-1}\\ \frac{\partial H_t}{\partial U} = \sigma^{\prime} \circ X_t \\ \frac{\partial H_t}{\partial b} = \sigma^{\prime} \circ b . $$

RNN Cell

For each step from $t=1$ to $t=\tau$, the complete update equations of RNN:

$$ \begin{align} H_{t} &=\sigma\circ(W H_{t-1} + U X_{t} + b) \\ O_{t} &= \mathrm{softmax}(V H_{t} + c) \end{align} $$

where the parameters are the bias vectors $b$ and $c$ along with the weight matrices $U$, $V$ and $W$, respectively for input-to-hidden, hidden-to-output and hidden-to-hidden connections.

Types of RNN

The Back Propagation Through Time (BPTT) Algorithm

Bi-directional RNN

For each step from $t=1$ to $t = \tau$, the complete update equations of RNN:

$$ \begin{align} \stackrel{\rightarrow} H_{t} &= \sigma\circ(W_1\stackrel{\rightarrow}H_{t-1}+U_1X_{t}+b_1) \\ \stackrel{\leftarrow}H_{t} &= \sigma\circ(W_2\stackrel{\leftarrow}H_{t+1}+U_2X_{t}+b_2) \\ O_{t} &= \mathrm{softmax}(VH_{t}+c) \end{align} $$

where $H_{t} = [\stackrel{\rightarrow} H_{t};\stackrel{\leftarrow} H_{t}]$.

Bi-directional RNN
The bold line(__) is computed earlier than the dotted line(...).

deepai.org

LSTM

There are several architectures of LSTM units. A common architecture is composed of a memory cell, an input gate, an output gate and a forget gate. It is the first time to solve the gradient vanishing problem and long-term dependencies in deep learning.

LSTM block
See LSTM block at(https://devblogs.nvidia.com/wp-content/uploads/2016/03/LSTM.png)
  • forget gate

$$f_{t} = \sigma(W_{f}[h_{t-1},x_t]+b_f). \tag{forget gate}$$

forget gate
  • input gate

$$ i_{t} = \sigma(W_i[h_{t-1}, x_t] + b_i), \tilde{C}{t} = tanh(W_C[h{t-1}, x_t] + b_C).\tag{input gate} $$

input gate
  • memory cell

$$C_{t} = f_t \odot C_{t-1}+i_{t} \otimes {\tilde{C}_{t}}.\tag{memory cell}$$

memory cell
  • output gate

$$o_{t} = \sigma(W_{O}[h_{t-1},x_t]+b_{O}),\tag{ouput gate 1}$$ and $$Z_{t} = O_{t}\odot tanh(c_{t}).\tag{ouput gate 2}$$

output gate
Inventor of LSTM
more on (http://people.idsia.ch/~juergen/)
Internals of a LSTM Cell

Gated Recurrent Units (GRUs)

GRUs are a recently proposed alternative to LSTMs, that share its good properties, i.e., they are also designed to avoid the Vanishing Gradient problem. The main difference from LSTMs is that GRUs don’t have a cell memory state $C_t$ , but instead are able to function effectively using the Hidden State $Z_t$ alone.

  • Update Gate $$ \begin{equation} o_t = \sigma(W^o X_t + U^o Z_{t-1}). \tag{ Update Gate} \end{equation}$$

  • Reset Gate $$ \begin{equation} r_t = \sigma(W^r X_t + U^r Z_{t-1}). \tag{Reset Gate} \end{equation} $$

  • Hidden State $$ \begin{equation} {\tilde Z}t = \tanh(r_t\odot U Z{t-1} + W X_t), \ Z_t = (1 - o_t)\odot {\tilde Z}t + o_t\odot Z{t-1}). \tag{Hidden State} \end{equation} $$


Deep RNN

Deep RNN is composed of RNN cell as MLP is composed of perceptrons. For each step from $t=1$ to $t=\tau$, the complete update equations of deep $d$-RNN at the $i$th layer: $$ \begin{align} H_{t}^{i} &= \sigma\circ(W_{i} H_{t-1} + U_{i} X_{t} + b) \ O_{t} &= \mathrm{softmax}(V H_{d} + c) \end{align} $$ where the parameters are the bias vectors $b$ and $c$ along with the weight matrices $U_i$, $V$ and $W_i$, respectively for input-to-hidden, hidden-to-output and hidden-to-hidden connections for $i\in {1,2,\cdots,d}$.

Other RNN cells also can compose deep RNN via this stacking way such as deep Bi-RNN networks.

Deep Bi-RNN
brnn

Attention Mechanism

An attention model is a method that takes $n$ arguments $y_1, \dots, y_n$ and a context $c$. It return a vector $z$ which is supposed to be the summary of the ${y}_i\in \mathbb{R}^{d}$, focusing on information linked to the context $c$. More formally, it returns a weighted arithmetic mean of the $y_i$, and the weights are chosen according the relevance of each $y_i$ given the context $c$. In mathematics, it can be expressed as:

$$ {\alpha}i = softmax [s(y_i,c)] \ z = \sum{i=1}^{n} {\alpha}_i y_i $$

where $s(\cdot, \cdot)$ is the attention scoring function. The attention scoring function $s({y}_i, c)$ is diverse, such as:

  • the additive model $s({y}_i, c) = v^{T} tanh\circ (W {y}_i + U c)$, where $v \in \mathbb{R}^{d}$, $W \in \mathbb{R}^{d\times d}$, $U \in \mathbb{R}^{d}$ are parameters to learn;
  • the inner product model $s({y}_i, c) = \left&lt; {y}_i, c \right&gt;$, i.e. the inner product of ${y}_i, c$;
  • the scaled inner product model $s({y}_i, c) = \frac{\left&lt; {y}_i, c \right&gt;}{d}$,where $d$ is the dimension of input ${y}_i$;
  • the bilinear model $s({y}_i, c) = {y}_i^{T} W c$, where $W\in \mathbb{R}^{d\times d}$ is parameter matrix to learn.

It is always as one component of some complex network as normalization.



Recursive Neural Network

Diagram of Recursive Neural Network

Generative Models

Generative Adversarial Network

http://unsupervised.cs.princeton.edu/deeplearningtutorial.html

It origins from http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf. It is a generative model via an adversarial process. It trains a generative model $G$ that captures the data distribution, and a discriminative model $D$ that estimates the probability that a sample came from the training data rather than $G$. The training procedure for $G$ is to maximize the probability of $D$ making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions $G$ and $D$, a unique solution exists, with $G$ recovering the training data distribution and $D$ equal to $\frac{1}{2}$ everywhere.

It is not to minimize the cost function or errors as that in supervised machine learning. In mathematics, it is saddle point optimization. Thus some optimization or regularization techniques are not suitable for this framework. It requires some methods to find the proper generator $G$ and discriminator $D$.

Generative Adversarial Network

As a generative model, it is really important to evaluate the quantity of the model output. Suppose there is a model used to write Chinese traditional poem, how does the machine know it is a fantastic masterpiece? How does it write a novel poem in a given topic or form? The loss function or evaluation is implicit.
One solution is to train another program to evaluate the performance of generative model.

The idea behind the GAN:

  • Idea 1: Deep nets are good at recognizing images, then let it judge of the outputs of a generative model;
  • Idea 2: If a good discriminator net has been trained, use it to provide “gradient feedback” that improves the generative model.
  • Idea 3: Turn the training of the generative model into a game of many moves or alternations.

In mathematics, it is in the following form $$\min_{G}\max_{D}\mathbb{E}{x\sim P} [f(D(x))] + \mathbb{E}{h}[f(1 - D(G(h)))]$$

where $G$ is the generator and $D$ is the discriminator.

Cycle-GAN


More generative models include GLOW, variational autoencoder and energy-based models.

Variational Autoencoder

PixelRNN


Geometric Deep Learning

Graph Convolution Network

Graph can be represented as adjacency matrix as shown in Graph Algorithm. However, the adjacency matrix only describe the connections between the nodes. The feature of the nodes does not appear. The node itself really matters. For example, the chemical bonds can be represented as adjacency matrix while the atoms in molecule really determine the properties of the molecule.

A naive approach is to concatenate the feature matrix $X\in \mathbb{R}^{N\times E}$ and adjacency matrix $A\in \mathbb{R}^{N\times N}$, i.e. $X_{in}=[X, A]\in \mathbb{R}^{N\times (N+E)}$. And what is the output?

How can deep learning apply to them?

For these models, the goal is then to learn a function of signals/features on a graph $G=(V,E)$ which takes as input:

  • A feature description $x_i$ for every node $i$; summarized in a $N\times D$ feature matrix $X$ ($N$: number of nodes, $D$: number of input features)
  • A representative description of the graph structure in matrix form; typically in the form of an adjacency matrix $A$ (or some function thereof)

and produces a node-level output $Z$ (an $N\times F$ feature matrix, where $F$ is the number of output features per node). Graph-level outputs can be modeled by introducing some form of pooling operation (see, e.g. Duvenaud et al., NIPS 2015).

Every neural network layer can then be written as a non-linear function $${H}{i+1} = \sigma \circ ({H}{i}, A)$$ with ${H}0 = {X}{in}$ and ${H}_{d} = Z$ (or $Z$ for graph-level outputs), $d$ being the number of layers. The specific models then differ only in how $\sigma$ is chosen and parameterized.

For example, we can consider a simple form of a layer-wise propagation rule $$ {H}{i+1} = \sigma \circ ({H}{i}, A)=\sigma \circ(A {H}{i} {W}{i}) $$ where ${W}_{i}$ is a weight matrix for the $i$-th neural network layer and $\sigma (\cdot)$ is is a non-linear activation function such as ReLU.

  • But first, let us address two limitations of this simple model: multiplication with $A$ means that, for every node, we sum up all the feature vectors of all neighboring nodes but not the node itself (unless there are self-loops in the graph). We can "fix" this by enforcing self-loops in the graph: we simply add the identity matrix ${I}$ to ${A}$.

  • The second major limitation is that $A$ is typically not normalized and therefore the multiplication with $A$ will completely change the scale of the feature vectors (we can understand that by looking at the eigenvalues of $A$).Normalizing $A$ such that all rows sum to one, i.e. $D^{−1}A$, where $D$ is the diagonal node degree matrix, gets rid of this problem.

In fact, the propagation rule introduced in Kipf & Welling (ICLR 2017) is given by: $$ {H}{i+1} = \sigma \circ ({H}{i}, A)=\sigma \circ(\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}} {H}{i} {W}{i}), $$ with $\hat{A}=A+I$, where ${I}$ is the identity matrix and $\hat{D}$ is the diagonal node degree matrix of $\hat{A}$. See more details at Multi-layer Graph Convolutional Network (GCN) with first-order filters.

Like other neural network, GCN is also composite of linear and nonlinear mapping. In details,\

  1. $\hat{D}^{-\frac{1}{2}} \hat{A} \hat{D}^{-\frac{1}{2}}$ is to normalize the graph structure;
  2. the next step is to multiply node properties and weights;
  3. Add nonlinearities by activation function $\sigma$.

See more at experoinc.com.

$\color{navy}{\text{Graph convolution network is potential to}}, \mathcal{reasoning}$ as the blend of $\mathfrak{\text{probabilistic graph model}}$ and $\mit{\text{deep learning}}$.

GCN can be regarded as the counterpart of CNN for graphs so that the optimization techniques such as normalization, attention mechanism and even the adversarial version can be extended to the graph structure.

ChebNet, CayleyNet, MotifNet

In the previous post, the convolution of the graph Laplacian is defined in its graph Fourier space as outlined in the paper of Bruna et. al. (arXiv:1312.6203). However, the eigenmodes of the graph Laplacian are not ideal because it makes the bases to be graph-dependent. A lot of works were done in order to solve this problem, with the help of various special functions to express the filter functions. Examples include Chebyshev polynomials and Cayley transform.

Defining filters as polynomials applied over the eigenvalues of the graph Laplacian, it is possible indeed to avoid any eigen-decomposition and realize convolution by means of efficient sparse routines The main idea behind CayleyNet is to achieve some sort of spectral zoom property by means of Cayley transform.

CayleyNet

Defining filters as polynomials applied over the eigenvalues of the graph Laplacian, it is possible indeed to avoid any eigen-decomposition and realize convolution by means of efficient sparse routines The main idea behind CayleyNet is to achieve some sort of spectral zoom property by means of Cayley transform: $$ C(\lambda) = \frac{\lambda - i}{\lambda + i} $$

Instead of Chebyshev polynomials, it approximates the filter as: $$ g(\lambda) = c_0 + \sum_{j=1}^{r}[c_jC^{j}(h\lambda) + c_j^{\star} C^{j^{\star}}(h\lambda)] $$ where $c_0$ is real and other $c_j$’s are generally complex, and ${h}$ is a zoom parameter, and $\lambda$’s are the eigenvalues of the graph Laplacian. Tuning ${h}$ makes one find the best zoom that spread the top eigenvalues. ${c}$'s are computed by training. This solves the problem of unfavorable clusters in ChebNet.

MotifNet

MotifNet is aimed to address the direted graph convolution.

Graph Embedding

Differential geometry based geometric data analysis (DG-GDA) of molecular datasets

DeepRL

Reinforcement Learning (RL) has achieved many successes over the years in training autonomous agents to perform simple tasks. However, one of the major remaining challenges in RL is scaling it to high-dimensional, real-world applications.

Although many works have already focused on strategies to scale-up RL techniques and to find solutions for more complex problems with reasonable successes, many issues still exist. This workshop encourages to discuss diverse approaches to accelerate and generalize RL, such as the use of approximations, abstractions, hierarchical approaches, and Transfer Learning.

Scaling-up RL methods has major implications on the research and practice of complex learning problems and will eventually lead to successful implementations in real-world applications.

DeepRL

Deep Learning Ensemble

Deep learning and ensemble learning share some similar guide line.

Selective Ensemble

An ensemble is generated by training multiple component learners for a same task and then combining their predictions. In most ensemble algorithms, all the trained component learners are employed in constituting an ensemble. But recently, it has been shown that when the learners are neural networks, it may be better to ensemble some instead of all of the learners. In this paper, this claim is generalized to situations where the component learners are decision trees. Experiments show that ensembles generated by a selective ensemble algorithm, which selects some of the trained C4.5 decision trees to make up an ensemble, may be not only smaller in the size but also stronger in the generalization than ensembles generated by non-selective algorithms.

Snapshot

In contrast to traditional ensembles (produce an ensemble of multiple neural networks), the goal of this work is training a single neural network, converging to several local minima along its optimization path and saving the model parameters to obtain a ensembles model. It is clear that the number of possible local minima grows exponentially with the number of parameters and different local minima often have very similar error rates, the corresponding neural networks tend to make different mistakes.

snapshot

Snapshot Ensembling generate an ensemble of accurate and diverse models from a single training with an optimization process which visits several local minima before converging to a final solution. In each local minima, they save the parameters as a model and then take model snapshots at these various minima, and average their predictions at test time.

Fast Geometric Ensembling (FGE)

Ensemble Distribution Distillation

EnD^2 enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation. A solution for EnD^2 based on Prior Networks, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work. The properties of EnD^2 are investigated on both an artificial dataset, and on the CIFAR-10, CIFAR-100 and TinyImageNet datasets, where it is shown that EnD^2 can approach the classification performance of an ensemble, and outperforms both standard DNNs and Ensemble Distillation on the tasks of misclassification and out-of-distribution input detection.

Stochastic Weight Averaging (SWA)

SelfieBoost

Theories of Deep Learning

ICML 2017 organized a workshop on Principled Approaches to Deep Learning:

The recent advancements in deep learning have revolutionized the field of machine learning, enabling unparalleled performance and many new real-world applications. Yet, the developments that led to this success have often been driven by empirical studies, and little is known about the theory behind some of the most successful approaches. While theoretically well-founded deep learning architectures had been proposed in the past, they came at a price of increased complexity and reduced tractability. Recently, we have witnessed considerable interest in principled deep learning. This led to a better theoretical understanding of existing architectures as well as development of more mature deep models with solid theoretical foundations. In this workshop, we intend to review the state of those developments and provide a platform for the exchange of ideas between the theoreticians and the practitioners of the growing deep learning community. Through a series of invited talks by the experts in the field, contributed presentations, and an interactive panel discussion, the workshop will cover recent theoretical developments, provide an overview of promising and mature architectures, highlight their challenges and unique benefits, and present the most exciting recent results.

Topics of interest include, but are not limited to:

  • Deep architectures with solid theoretical foundations
  • Theoretical understanding of deep networks
  • Theoretical approaches to representation learning
  • Algorithmic and optimization challenges, alternatives to backpropagation
  • Probabilistic, generative deep models
  • Symmetry, transformations, and equivariance
  • Practical implementations of principled deep learning approaches
  • Domain-specific challenges of principled deep learning approaches
  • Applications to real-world problems

There are more mathematical perspectives to deep learning: dynamical system, thermodynamics, Bayesian statistics, random matrix, numerical optimization, algebra and differential equation.

The information theory or code theory helps to accelerate the deep neural network inference as well as computer system design.

The limitation and extension of deep learning methods is also discussed such as F-principle, capsule-net, biological plausible methods. The deep learning method is more engineer. The computational evolutionary adaptive cognitive intelligence does not occur until now.



Application

IN fact, deep learning is applied widely in prediction and classification in diverse domain from image recognition to drug discovery even scientific computation though there is no first principle to guide how to apply deep learning to these domain. The history of deep learning begins with its performance in image recognition and spoken language recognition over human being. Current progress is in natural language processing with BERT, XLNET and more.

Even deep learning is young and cut-edge, some pioneers contribute to its development.

Computer Vision

Spoken Language Processing

Natural Language Processing

Finance

Brain and Cognition Science

The Future

The ultimate goal is general artificial intelligence.

Beyond Back-propagation

Training deep learning models does not require gradients such as ADMM, simulated annealing.

Operator Splitting Methods For Training Deep Neural Network

By rewriting the activation function as an equivalent proximal operator, we approximate a feed-forward neural network by adding the proximal operators to the objective function as penalties, hence we call the lifted proximal operator machine (LPOM). LPOM is block multi-convex in all layer-wise weights and activations. This allows us to use block coordinate descent to update the layer-wise weights and activations in parallel. Most notably, we only use the mapping of the activation function itself, rather than its derivatives, thus avoiding the gradient vanishing or blow-up issues in gradient based training methods. So our method is applicable to various non-decreasing Lipschitz continuous activation functions, which can be saturating and non-differentiable. LPOM does not require more auxiliary variables than the layer-wise activations, thus using roughly the same amount of memory as stochastic gradient descent (SGD) does. We further prove the convergence of updating the layer-wise weights and activations. Experiments on MNIST and CIFAR-10 datasets testify to the advantages of LPOM.

Layer-wise Relevance Propagation

Layer-wise Relevance Propagation (LRP) is a method that identifies important pixels by running a backward pass in the neural network. The backward pass is a conservative relevance redistribution procedure, where neurons that contribute the most to the higher-layer receive most relevance from it. The LRP procedure is shown graphically in the figure below.

Target Propagation

Back-propagation has been the workhorse of recent successes of deep learning but it relies on infinitesimal effects (partial derivatives) in order to perform credit assignment. This could become a serious issue as one considers deeper and more non-linear functions, e.g., consider the extreme case of non-linearity where the relation between parameters and cost is actually discrete. Inspired by the biological implausibility of back-propagation, a few approaches have been proposed in the past that could play a similar credit assignment role as backprop. In this spirit, we explore a novel approach to credit assignment in deep networks that we call target propagation. The main idea is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. In a way that is related but different from previously proposed proxies for back-propagation which rely on a backwards network with symmetric weights, target propagation relies on auto-encoders at each layer. Unlike back-propagation, it can be applied even when units exchange stochastic bits rather than real numbers. We show that a linear correction for the imperfectness of the auto-encoders is very effective to make target propagation actually work, along with adaptive learning rates.

Gradient Target Propagation

We report a learning rule for neural networks that computes how much each neuron should contribute to minimize a giving cost function via the estimation of its target value. By theoretical analysis, we show that this learning rule contains backpropagation, Hebbian learning, and additional terms. We also give a general technique for weights initialization. Our results are at least as good as those obtained with backpropagation.

Capsule Networks and More

Capsule Networks provide a way to detect parts of objects in an image and represent spatial relationships between those parts. This means that capsule networks are able to recognize the same object in a variety of different poses even if they have not seen that pose in training data.

MIND-Net

From the analytical perspective, the ad hoc nature of deep learning renders its success at the mercy of trial-and-errors. To rectify this problem, we advocate a methodic learning paradigm, MIND-Net, which is computationally efficient in training the networks and yet mathematically feasible to analyze. MIND-Net hinges upon the use of an effective optimization metric, called Discriminant Information (DI). It will be used as a surrogate of the popular metrics such as 0-1 loss or prediction accuracy. Mathematically, DI is equivalent or closely related to Gauss’ LSE, Fisher’s FDR, and Shannon’s Mutual Information. We shall explain why is that higher DI means higher linear separability, i.e. higher DI means that the data are more discriminable. In fact, it can be shown that, both theoretically and empirically, a high DI score usually implies a high prediction accuracy.