Deep Learning

Lecture: Recurrent neural networks (optional)

Prof. Gilles Louppe
g.louppe@uliege.be

Today

How to make sense of sequential data?

Recurrent neural networks
Applications
Beyond sequences

Many real-world problems require to process a signal with a sequence structure.

Sequence classification:
- sentiment analysis in text
- activity/action recognition in videos
- DNA sequence classification
Sequence synthesis:
- text synthesis
- music synthesis
- motion synthesis
Sequence-to-sequence translation:
- speech recognition
- text translation
- part-of-speech tagging

???

Draw all 3 setups.

Given a set $\mathcal{X}$, if $S(\mathcal{X})$ denotes the set of sequences of elements from $\mathcal{X}$, $$S(\mathcal{X}) = \cup_{t=1}^\infty \mathcal{X}^t,$$ then we formally define:

.grid.center[ .kol-1-2.bold[Sequence classification] .kol-1-2[$f: S(\mathcal{X}) \to \bigtriangleup^C$] ] .grid.center[ .kol-1-2.bold[Sequence synthesis] .kol-1-2[$f: \mathbb{R}^d \to S(\mathcal{X})$] ] .grid.center[ .kol-1-2.bold[Sequence-to-sequence translation] .kol-1-2[$f: S(\mathcal{X}) \to S(\mathcal{Y})$] ]

In the rest of the slides, we consider only time-indexed signal, although it generalizes to arbitrary sequences.

Recurrent neural networks

When the input is a sequence $\mathbf{x} \in S(\mathbb{R}^p)$ of variable length $T(\mathbf{x})$, the historical approach is to use a recurrent model which maintains a recurrent state $\mathbf{h}_t \in \mathbb{R}^q$ updated at each time step $t$.

Formally, for $t=1, ..., T(\mathbf{x})$, $$ \mathbf{h}_t = \phi(\mathbf{x}_t, \mathbf{h}_{t-1};\theta), $$ where $\phi : \mathbb{R}^p \times \mathbb{R}^q \to \mathbb{R}^q$ and $\mathbf{h}_0 \in \mathbb{R}^q$.

Predictions can be computed at any time step $t$ from the recurrent state, $$y_t = \psi(\mathbf{h}_t;\theta),$$ with $\psi : \mathbb{R}^q \to \mathbb{R}^C$.

???

Draw the next diagrams on the black board.

Even though the number of steps $T$ depends on $\mathbf{x}$, this is a standard computational graph, and automatic differentiation can deal with it as usual.

In the case of recurrent neural networks, this is referred to as backpropagation through time.

Elman networks

Elman networks consist of $\phi$ and $\psi$ defined as primitive neuron units, such as logistic regression units $$ \begin{aligned} \mathbf{h}_t &= \sigma_h\left( \mathbf{W}^T_{xh} \mathbf{x}_t + \mathbf{W}^T_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h \right) \\ y_t &= \sigma_y\left( \mathbf{W}_y^T \mathbf{h}_t + b_y \right) \end{aligned} $$ where $\mathbf{W}^T_{xh} \in \mathbb{R}^{p\times q}, \mathbf{W}^T_{hh} \in \mathbb{R}^{q\times q}, \mathbf{b}_{h} \in \mathbb{R}^{q}, b_{y} \in \mathbb{R}, \mathbf{h}_0=0$, and where $\sigma_h$ and $\sigma_y$ are non-linear activation functions, such as the sigmoid function, $\text{tanh}$ or $\text{ReLU}$.

Benchmark example

Learn to recognize variable-length sequences that are palindromes. For training, we use sequences of random sizes, from $1$ to $10$.

$(1,2,3,2,1)$
$(2,1,2)$
$(3,4,1,2)$
$(0)$
$(1,4)$ ] .kol-1-4.center[ $y$

$1$
$1$
$0$
$1$
$0$ ] ]

.center.width-80[]

Bidirectional RNNs

Computing the recurrent states forward in time does not make use of future input values $\mathbf{x}_{t+1:T}$, even though there are known.

RNNs can be made bidirectional by consuming the sequence in both directions.
Effectively, this amounts to run the same (single direction) RNN twice:
- once over the original sequence $\mathbf{x}_{1:T}$,
- once over the reversed sequence $\mathbf{x}_{T:1}$.
The resulting recurrent states of the bidirectional RNN is the concatenation of two resulting sequences of recurrent states.

.center.width-80[]

Stacked RNNs

Recurrent networks can be viewed as layers producing sequences $\mathbf{h}_{1:T}^\ell$ of activations.

As for dense layers, recurrent layers can be composed in series to form a .bold[stack] of recurrent networks.

.center.width-100[]

.center.width-80[]

Gating

When unfolded through time, the graph of computation of a recurrent network can grow very deep, and training involves dealing with vanishing gradients.

RNN cells should include a pass-through, or additive paths, so that the recurrent state does not go repeatedly through a squashing non-linearity.
This is identical to skip connections in ResNet.

.center.width-70[]

For instance, the recurrent state update can be a per-component weighted average of its previous value $\mathbf{h}_{t-1}$ and a full update $\bar{\mathbf{h}}_t$, with the weighting $\mathbf{z}_t$ depending on the input and the recurrent state, hence acting as a forget gate.

Formally, $$ \begin{aligned} \bar{\mathbf{h}}_t &= \phi(\mathbf{x}_t, \mathbf{h}_{t-1};\theta) \\ \mathbf{z}_t &= f(\mathbf{x}_t, \mathbf{h}_{t-1};\theta) \\ \mathbf{h}_t &= \mathbf{z}_t \odot \mathbf{h}_{t-1} + (1-\mathbf{z}_t) \odot \bar{\mathbf{h}}_t. \end{aligned} $$

.center.width-80[]

LSTM

The long short-term memory model (LSTM; Hochreiter and Schmidhuber, 1997) is an instance of the previous gated recurrent cell, with the following changes:

The recurrent state is split into two parts $\mathbf{c}_t$ and $\mathbf{h}_t$, where
- $\mathbf{c}_t$ is the cell state and
- $\mathbf{h}_t$ is output state.
A forget gate $\mathbf{f}_t$ selects the cell state information to erase.
An input gate $\mathbf{i}_t$ selects the cell state information to update.
An output gate $\mathbf{o}_t$ selects the cell state information to output.

.center.width-80[]

$$ \begin{aligned} \mathbf{f}_t &= \sigma\left( \mathbf{W}_f^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_f \right) \end{aligned} $$

.center.width-80[]

$$ \begin{aligned} \mathbf{i}_t &= \sigma\left( \mathbf{W}_i^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_i \right) \\\ \bar{\mathbf{c}}_t &= \text{tanh}\left( \mathbf{W}_c^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_c \right) \end{aligned} $$

.center.width-80[]

$$ \begin{aligned} \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \bar{\mathbf{c}}_t \end{aligned} $$

.center.width-80[]

$$ \begin{aligned} \mathbf{o}_t &= \sigma\left( \mathbf{W}_o^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_o \right) \\\ \mathbf{h}_t &= \mathbf{o}_t \odot \text{tanh}(\mathbf{c}_{t}) \end{aligned} $$

.center.width-80[]

GRU

The gated recurrent unit (GRU; Cho et al, 2014) is another gated recurrent cell. It uses two gates instead of three: an update gate $\mathbf{z}_t$ and a reset gate $\mathbf{r}_t$.

???

GRUs perform similarly as LSTMs for language or speech modeling sequences, but with fewer parameters.
However, LSTMs remain strictly stronger than GRUs.

.center.width-70[]

$$ \begin{aligned} \mathbf{z}_t &= \sigma\left( \mathbf{W}_z^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_z \right) \\\ \mathbf{r}_t &= \sigma\left( \mathbf{W}_r^T [ \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_r \right) \\\ \bar{\mathbf{h}}_t &= \text{tanh}\left( \mathbf{W}_h^T [ \mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t ] + \mathbf{b}_h \right) \\\ \mathbf{h}_t &= (1-\mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \bar{\mathbf{h}}_t \end{aligned} $$

.center.width-80[]

.center.width-80[]

(demo)

Exploding gradients

Gated units prevent gradients from vanishing, but not from exploding.

.center.width-95[![](figures/archives-lec-rnn/loss-exploding.png)]

.center.width-60[]

The standard strategy to solve this issue is gradient norm clipping, which rescales the norm of the gradient to a fixed threshold $\delta$ when it is above: $$\tilde{\nabla} f = \frac{\nabla f}{||\nabla f||} \min(||\nabla f||, \delta).$$

Orthogonal initialization

Let us consider a simplified RNN, with no inputs, no bias, an identity activation function $\sigma$ (as in the positive part of a ReLU) and the initial recurrent state $\mathbf{h}_0$ set to the identity matrix.

We have, $$ \begin{aligned} \mathbf{h}_t &= \sigma\left( \mathbf{W}^T_{xh} \mathbf{x}_t + \mathbf{W}^T_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h \right) \\ &= \mathbf{W}^T_{hh} \mathbf{h}_{t-1} \\ &= \mathbf{W}^T \mathbf{h}_{t-1}. \end{aligned} $$

For a sequence of size $n$, it comes $$\mathbf{h}_n = \mathbf{W}(\mathbf{W}(\mathbf{W}(...(\mathbf{W}\mathbf{h}_0)...))) = \mathbf{W}^n\mathbf{h}_0 = \mathbf{W}^n I = \mathbf{W}^n.$$

Ideally, we would like $\mathbf{W}^n$ to neither vanish nor explode as $n$ increases.

???

Do on the iPad

Fibonacci digression

The Fibonacci sequence is $$0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, ...$$ It grows fast! But how fast?

In matrix form, the Fibonacci sequence is equivalently expressed as

$$ \begin{pmatrix} f_{k+2} \\\ f_{k+1} \end{pmatrix} = \begin{pmatrix} 1 & 1 \\\ 1 & 0 \end{pmatrix}\begin{pmatrix} f_{k+1} \\\ f_{k} \end{pmatrix}. $$

With $\mathbf{f}_0 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}$, we have $$\mathbf{f}_{k+1} = \mathbf{A} \mathbf{f}_{k} = \mathbf{A}^{k+1} \mathbf{f}_{0}.$$

The matrix $\mathbf{A}$ can be diagonalized as $$\mathbf{A} = \mathbf{S} \Lambda \mathbf{S}^{-1},$$ where $$ \begin{aligned} \Lambda &= \begin{pmatrix} \varphi & 0 \\ 0 & -\varphi^{-1} \end{pmatrix}\\ \mathbf{S} &= \begin{pmatrix} \varphi & -\varphi^{-1} \\ 1 & 1 \end{pmatrix} \end{aligned}. $$

In particular, $$\mathbf{A}^n = \mathbf{S} \Lambda^n \mathbf{S}^{-1}.$$

Therefore, the Fibonacci sequence grows exponentially fast with the golden ratio $\varphi$.

???

$\varphi = \frac{1+\sqrt{5}}{2}$

Theorem

Let $\rho(\mathbf{A})$ be the spectral radius of the matrix $\mathbf{A}$, defined as $$\rho(\mathbf{A}) = \max\{ |\lambda_1|,...,|\lambda_d| \}.$$

We have:

if $\rho(\mathbf{A}) < 1$ then $\lim_{n\to\infty} ||\mathbf{A}^n|| = \mathbf{0}$ (= vanishing activations),
if $\rho(\mathbf{A}) > 1$ then $\lim_{n\to\infty} ||\mathbf{A}^n|| = \infty$ (= exploding activations).

$\rho(\mathbf{A}) < 1$, $\mathbf{A}^n$ vanish. ]

.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]

$\rho(\mathbf{A}) > 1$, $\mathbf{A}^n$ explode. ]

.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]

Orthogonal initialization

If $\mathbf{A}$ is orthogonal, then it is diagonalizable and all its eigenvalues are equal to $-1$ or $1$. In this case, the norm of $$\mathbf{A}^n = \mathbf{S} \Lambda^n \mathbf{S}^{-1}$$ remains bounded.

Therefore, initializing $\mathbf{W}$ as a random orthogonal matrix will guarantee that activations will neither vanish nor explode.
In practice, a random orthogonal matrix can be found through the SVD decomposition or the QR factorization of a random matrix.
This initialization strategy is known as orthogonal initialization.

$\mathbf{A}$ is orthogonal. ]

.footnote[Credits: Stephen Merety, Explaining and illustrating orthogonal initialization for recurrent neural networks, 2016.]

Exploding activations are also the reason why squashing non-linearity functions (such as $\text{tanh}$ or sigmoids) are preferred in RNNs, since they avoid recurrent states from exploding by upper bounding $||\mathbf{h}_t||$.

(At least when running the network forward.)

???

https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf

Some applications

Sentiment analysis

Document-level modeling for sentiment analysis (= text classification),
with stacked, bidirectional and gated recurrent networks. ]

.footnote[Credits: Duyu Tang et al, Document Modeling with Gated Recurrent Neural Network for Sentiment Classification, 2015.]

Language models

Language models model language as a Markov chain, in which sentences are sequences of words $\mathbf{w}_{1:T}$ drawn repeatedly from $p(\mathbf{w}_t | \mathbf{w}_{1:t-1}).$

This is an instance of sequence synthesis, for which predictions are computed at all time steps $t$. .center[ .width-60[] ]

Sequence synthesis

The same generative architecture applies to any kind of sequences. E.g., sketch-rnn-demo for sketches defined as sequences of strokes.

.center.width-80[]

Neural machine translation

.footnote[Credits: Yonghui Wu et al, Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.]

.footnote[Credits: Yonghui Wu et al, Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.]

Text-to-speech synthesis

.width-80.center[]

Lip-reading in the wild

.width-80.center[]

Learning to control

A recurrent network playing Mario Kart.

Beyond sequences

.center.circle.width-30[]

.italic[ An increasingly large number of .bold[people are defining the networks procedurally in a data-dependent way (with loops and conditionals)], allowing them to change dynamically as a function of the input data fed to them. It's really .bold[very much like a regular program, except it's parameterized]. ]

Programs as neural nets

The topology of a recurrent network unrolled through time is not fixed, but dynamic. It depends on:

the input sequence and its size
a graph construction algorithms which consumes input tokens in sequence to add layers to the graph of computation.

This principle generalizes to:

arbitrarily structured data (e.g., sequences, trees, graphs)
arbitrary graph of computation construction algorithms that traverses these structures (e.g., including for-loops or recursive calls).

Neural message passing

Even though the graph topology is dynamic, the unrolled computation is fully differentiable. The program is trainable.

Graph neural network for object detection in point clouds

.center.width-100[]

Quantum chemistry with graph networks

.center.width-65[]

???

quantum-mechanical properties of molecular systems

Learning to simulate physics with graph networks

.center.width-100[]

]

Neural computers

.center.width-55[]

.center[Any Turing machine can be simulated by a recurrent neural network
(Siegelmann and Sontag, 1995)]

???

This implies that usual programs can all be equivalently implemented as a neural network, in theory.

.center.width-80[]

Networks can be coupled with memory storage to produce neural computers:

The controller processes the input sequence and interacts with the memory to generate the output.
The read and write operations attend to all the memory addresses.

???

A Turing machine is not very practical to implement useful programs.
However, we can simulate the general architecture of a computer in a neural network.

A differentiable neural computer being trained to store and recall dense binary numbers. Upper left: the input (red) and target (blue), as 5-bit words and a 1 bit interrupt signal. Upper right: the model's output

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archives-lecture-rnn.md

archives-lecture-rnn.md

Deep Learning

Today

Recurrent neural networks

Elman networks

Benchmark example

Bidirectional RNNs

Stacked RNNs

Gating

LSTM

GRU

Exploding gradients

Orthogonal initialization

Fibonacci digression

Theorem

Orthogonal initialization

Some applications

Sentiment analysis

Language models

Sequence synthesis

Neural machine translation

Text-to-speech synthesis

Lip-reading in the wild

Learning to control

Beyond sequences

Programs as neural nets

Neural message passing

Graph neural network for object detection in point clouds

Quantum chemistry with graph networks

Learning to simulate physics with graph networks

Neural computers

Files

archives-lecture-rnn.md

Latest commit

History

archives-lecture-rnn.md

File metadata and controls

Deep Learning

Today

Recurrent neural networks

Elman networks

Benchmark example

Bidirectional RNNs

Stacked RNNs

Gating

LSTM

GRU

Exploding gradients

Orthogonal initialization

Fibonacci digression

Theorem

Orthogonal initialization

Some applications

Sentiment analysis

Language models

Sequence synthesis

Neural machine translation

Text-to-speech synthesis

Lip-reading in the wild

Learning to control

Beyond sequences

Programs as neural nets

Neural message passing

Graph neural network for object detection in point clouds

Quantum chemistry with graph networks

Learning to simulate physics with graph networks

Neural computers