Skip to content

Latest commit

 

History

History
239 lines (189 loc) · 17.1 KB

ReLU Networks.md

File metadata and controls

239 lines (189 loc) · 17.1 KB

ReLU Networks

The ReLU activation funtion is a nonlinear function defined as $$\operatorname{ReLU}(x)=\sigma(x)=\max(x,0)=[x]{+}=\arg\max{x}|x-z|2^2+\chi{z>0}(z)$$ where $\chi_{z>0}(z)=0$ if $z>0$ otherwise $\chi_{z>0}(z)=\infty$. We say $\chi_{z>0}$ is the character function of the set ${z>0}$. And we can generalize it to high dimensional space, i.e., for the vector $\vec{x}\in\mathbb{R}^n$, $$\sigma(\vec{x})=\operatorname{ReLU}(\vec{x})=(\operatorname{ReLU}(x_1),\cdots, \operatorname{ReLU}(x_i),\cdots, \operatorname{ReLU}(x_n))^T.$$

This operator is a projection operator so that $\sigma\circ\sigma(x)=\sigma(x)$ for all $x\in\mathbb{R}$. It maps the real number to the nonnegative real space: $\mathbb{R}\to\mathbb{R}{+}$. And we can rewrite it in the following way $$\sigma(z)=\mathbb{I}{z>0}(z)z,\ \mathbb{I}{z>0}(z)=\begin{cases} 1, &\text{ if $z>0$};\ 0, &\text{ otherwise}. \end{cases}$$ The notable advantage of this operator is to overcome the gradient vanishing because the gradient(precisely subgradient at the origin point) is constant: $$\sigma^{\prime}(z)=\frac{d\sigma(z)}{d z}=\mathbb{I}{z>0}(z)\quad\forall x\not=0.$$

And we can find that $$\sigma(x)=\sigma^{\prime}(x)x\ \sigma(x)=\sigma^{\prime}(x)\sigma(x)$$

Here we would introduce some application of ReLU before deep learning.

Regression discontinuity has emerged as one of the most credible non-experimental strategies for the analysis of causal effects. In the RD design, all units have a score, and a treatment is assigned to those units whose value of the score exceeds a known cutoff or threshold, and not assigned to those units whose value of the score is below the cutoff. The key feature of the design is that the probability of receiving the treatment changes abruptly at the known threshold. If units are unable to perfectly “sort” around this threshold, the discontinuous change in this probability can be used to learn about the local causal effect of the treatment on an outcome of interest, because units with scores barely below the cutoff can be used as a comparison group for units with scores barely above it.

Multivariate Adaptive Regression Splines (MARS) is a method for flexible modelling of high dimensional data. The model takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data. This procedure is motivated by recursive partitioning (e.g. CART) and shares its ability to capture high order interactions. However, it has more power and flexibility to model relationships that are nearly additive or involve interactions in at most a few variables, and produces continuous models with continuous derivatives. In addition, the model can be represented in a form that separately identifies the additive contributions and those associated with different multivariable interactions.

Those pieces come together into a learning function $F(x, v)$ with weights $x$ that capture information from the training data $v$-to prepare for use with new test data. Here are important steps in creating that function $F$ :

Order Component Meaning
1 Key operation Composition $F = F_a ( F_2 ( F_1 ( x, v)))$
2 Key rule Chain rule for $x$-derivatives of $F$
3 Key algorithm Stochastic gradient descent to find the best weights $x$
4 Key subroutine Backpropagation to execute the chain rule
5 Key nonlinearity $ReLU(y) = max(y, 0) =\text{ramp function}$

$$\fbox{The learning function $F$ is continuous and piecewise linear in $v$.}$$

The ReLU networks take the ReLU layers as its components: $$\sigma(W_ih_i), h_i=\sigma(W_{i-1}h_{i-1})\quad\forall 1\leq i\leq L$$ where $h_0$ is the raw input $x$ and $W_i$ is linear operator.

  • (1) When $x\in\mathbb{R}$, it is simple: $W_i\in\mathbb{R}$.
  • (2) When $x\in\mathbb{R}^n\quad \forall n\geq 2$, $W_i\in\mathbb{R}^{m\times n}$ so $W_ih_i$ is matrix-vector multiplication. It is a piecewise linear function no matter how large the layer number $L$ is. And $\sigma(W_Lh_L)=W_{x}x$ where $W_{x}$ is a matrix determined by the raw input $h_0(x)$.
  • (3) When $x\in\mathbb{R}^{m\times n}\quad \forall n\geq,m\geq 2$ such as in ConvNet, $W_ih_i$ is the result of the convolution operator and $W_i$ is the convolution kernel (filter).

ReLU Networks Approximation and Expression Ability

We can use the ReLU function to approximate an indictor function: $$\sigma(x)-\sigma(x-\frac{1}{a})=\begin{cases} 0, &\text{if $x\leq 0$;}\ x, &\text{if $0< x\leq \frac{1}{a}$;}\ \frac{1}{a}, &\text{otherwise}. \end{cases}$$

And we can use it to generate more functions such as $t(x) := \sigma(x) − \sigma(x − 1) − \sigma(x − 2) + \sigma(x − 3)$.

  • Every continuous function can be approximated up to an error of $\varepsilon > 0$ with a neural network with a single hidden layer and with $O(N)$ neurons.
  • We can leads to appropriate shearlet generator with neural networks.
  • Deep neural networks are optimal for the approximation of piecewise smooth functions on manifolds.
  • For certain topologies, the standard backpropagation algorithm generates a deep neural network which provides those optimal approximation rates; interestingly even yielding $\alpha$-shearlet-like functions.

Here we show that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. This answers an open question in learning theory. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Our study also demonstrates the role of convolutions in deep CNNs.


Convergence

Generalization of Deep ReLU Networks

Recently, path norm was proposed as a new capacity measure for neural networks with Rectified Linear Unit (ReLU) activation function, which takes the rescaling-invariant property of ReLU into account. It has been shown that the generalization error bound in terms of the path norm explains the empirical generalization behaviors of the ReLU neural networks better than that of other capacity measures. Moreover, optimization algorithms which take path norm as the regularization term to the loss function, like Path-SGD, have been shown to achieve better generalization performance. However, the path norm counts the values of all paths, and hence the capacity measure based on path norm could be improperly influenced by the dependency among different paths. It is also known that each path of a ReLU network can be represented by a small group of linearly independent basis paths with multiplication and division operation, which indicates that the generalization behavior of the network only depends on only a few basis paths. Motivated by this, we propose a new norm Basis-path Norm based on a group of linearly independent paths to measure the capacity of neural networks more accurately. We establish a generalization error bound based on this basis path norm, and show it explains the generalization behaviors of ReLU networks more accurately than previous capacity measures via extensive experiments. In addition, we develop optimization algorithms which minimize the empirical risk regularized by the basis-path norm. Our experiments on benchmark datasets demonstrate that the proposed regularization method achieves clearly better performance on the test set than the previous regularization approaches.

We will show how, in the regime of deep learning, the characterization of generalization becomes different from the conventional way, and propose alternative ways to approach it. Moving from theory to more practical perspectives, we will show two different applications of deep learning. One is originated from a real world problem of automatic geophysical feature detection from seismic recordings to help oil & gas exploration; the other is motivated from a computational neuroscientific modeling and studying of human auditory system. More specifically, we will show how deep learning could be adapted to play nicely with the unique structures associated with the problems from different domains. Lastly, we move to the computer system design perspective, and present our efforts in building better deep learning systems to allow efficient and flexible computation in both academic and industrial worlds.

Finite Element