Skip to content

Latest commit

 

History

History
550 lines (341 loc) · 15.5 KB

lecture3.md

File metadata and controls

550 lines (341 loc) · 15.5 KB

class: middle, center, title-slide

Deep Learning

Lecture 3: Automatic differentiation



Prof. Gilles Louppe
g.louppe@uliege.be


Today

  • Calculus
  • Automatic differentiation
  • Implementation
  • Beyond neural networks

class: middle

.center.circle.width-30[]

.italic[Implementing backpropagation by hand is like programming in assembly language. You will probably never do it, but it is important for having a mental model of how everything works.]

.pull-right[Roger Grosse]

???

Promise for today!


class: middle

.center.width-60[]

Motivation

  • Gradient-based training algorithms are the workhorse of deep learning.
  • Deriving gradients by hand is tedious and error prone. This becomes quickly impractical for complex models.
  • Changes to the model require rederiving the gradient.

.footnote[Image credits: Visualizing optimization trajectory of neural nets, Logan Yang, 2020.]


class: middle

Programs as differentiable functions

A program is defined as a composition of primitive operations that we know how to differentiate individually.

import jax.numpy as jnp
from jax import grad

def predict(params, inputs):
    for W, b in params:
        outputs = jnp.dot(inputs, W) + b
        inputs = jnp.tanh(outputs)
    return outputs

def loss_fun(params, inputs, targets):
    preds = predict(params, inputs)
    return jnp.mean((preds - targets)**2)

grad_fun = grad(loss_fun)

class: middle

.center.width-60[]

Modern frameworks support higher-order derivatives.

def tanh(x):
    y = jnp.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)

fp = grad(tanh)
fpp = grad(grad(tanh))  # what sorcery is this?!
...

???

Will show a demo later on.


class: middle

Automatic differentiation

Automatic differentiation (AD) provides a family of techniques for evaluating the derivatives of a function specified by a computer program.

  • $\neq$ symbolic differentiation, which aims at identifying some human-readable expression of the derivative.
  • $\neq$ numerical differentation (finite differences), which may introduce round-off errors.

class: middle

Calculus


Derivative

Let $f: \mathbb{R} \to \mathbb{R}$.

The derivative of $f$ is $$f'(x) = \frac{\partial f}{\partial x}(x) \triangleq \lim_{h \to 0} \frac{f(x + h) - f(x)}{h},$$ where

  • $f'(x)$ is the Lagrange notation,
  • $\frac{\partial f}{\partial x}(x)$ is the Leibniz notation.

class: middle, center

.width-80[]

The derivative $\frac{\partial f(x)}{\partial x}$ of $f$ represents its instantaneous rate of change at $x$.


Gradient

The gradient of $f : \mathbb{R}^n \to \mathbb{R}$ is $$\nabla f(\mathbf{x}) \triangleq \begin{bmatrix} \frac{\partial f}{\partial x_1}(\mathbf{x}) \\ \\ \vdots \\ \\ \frac{\partial f}{\partial x_n}(\mathbf{x}) \end{bmatrix} \in \mathbb{R}^n,$$ i.e., a vector that gathers the partial derivatives of $f$.

Applying the definition of the derivative coordinate-wise, we have $$\left[ \nabla f(\mathbf{x}) \right]_j = \frac{\partial f}{\partial x_j}(\mathbf{x}) = \lim_{h\to 0} \frac{f(\mathbf{x} + h\mathbf{e}_j) - f(\mathbf{x})}{h},$$ where $\mathbf{e}_j$ is the $j$-th basis vector.

???

Note how each coordinate-wise derivative is a directional derivative in the direction $\mathbf{e}_j$.


Jacobian

The Jacobian of $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ is $$\begin{aligned} J_\mathbf{f}(\mathbf{x}) = \frac{\partial \mathbf{f}}{\partial \mathbf{x}}(\mathbf{x}) &\triangleq \begin{bmatrix} \frac{\partial f_1}{\partial x_1}(\mathbf{x}) & \ldots & \frac{\partial f_1}{\partial x_n}(\mathbf{x})\\ \\ \vdots & & \vdots\\ \\ \frac{\partial f_m}{\partial x_1}(\mathbf{x}) & \ldots & \frac{\partial f_m}{\partial x_n}(\mathbf{x}) \end{bmatrix} \in \mathbb{R}^{m \times n} \\ &= \begin{bmatrix} \frac{\partial \mathbf{f}}{\partial x_1}(\mathbf{x}) & \ldots & \frac{\partial \mathbf{f}}{\partial x_n}(\mathbf{x}) \end{bmatrix} \\ &= \begin{bmatrix} \nabla f_1(\mathbf{x})^T \\ \\ \vdots \\ \\ \nabla f_m(\mathbf{x})^T \end{bmatrix} \\ \end{aligned}$$

The gradient's transpose is thus a wide Jacobian ($m=1$).


class: middle

Automatic differentiation


Chain composition


.center.width-100[]


Let us assume a function $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$ that decomposes as a chain composition $$\mathbf{f} = \mathbf{f}_t \circ \mathbf{f}_{t-1} \circ \ldots \circ \mathbf{f}_1,$$ for functions $\mathbf{f}_k : \mathbb{R}^{n_{k-1}} \times \mathbb{R}^{n_k}$, for $k=1, \ldots, t$.


class: middle

By the chain rule, $$ \begin{aligned} \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_0} &= \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \underbrace{\frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{0}}}_{\text{recursive case}} \\ \\ &= \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{t-2}} \ldots \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_0} \end{aligned} $$


class: middle

Forward accumulation

$$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_0} = \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \left( \frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{t-2}} \left( \ldots \left( \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_0}\right) \ldots \right)\right)$$

Reverse accumulation

$$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_0} = \left(\left( \ldots \left( \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{t-2}} \right) \ldots \right) \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \right) \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_0}$$


class: middle

Complexity

The time complexity of the forward and reverse accumulations are $$\mathcal{O}\left( n_0 \sum_{k=1}^{t-1} n_k n_{k+1} \right) \quad \text{and} \quad \mathcal{O}\left( n_t \sum_{k=0}^{t-2} n_k n_{k+1} \right).$$

(Prove it!)


.success[If $n\_t \ll n\_0$ (which is typical in deep learning), then .bold[backward accumulation is cheaper]. And vice-versa.]

???

Prove it.


Multi-layer perceptron

Chain compositions can be generalized to feedforward neural networks of the form $$\mathbf{x}_k = \mathbf{f}_k(\mathbf{x}_{k-1}, \theta_{k})$$ for $k=1, \ldots, t$, and where $\theta_{k}$ are vectors of parameters and $\mathbf{x}_0 \in \mathbb{R}^{n_0}$ is given. In supervised learning, $\mathbf{f}_t$ usually corresponds to a scalar loss $\ell$, hence $n_t = 1$.



.center.width-100[]


class: middle, center

(whiteboard example)


AD on computer programs

Let $\mathbf{f}(\mathbf{x}_1, \ldots, \mathbf{x}_s)$ denote a generic function where

  • $\mathbf{x}_1, \ldots, \mathbf{x}_s$ are the input variables,
  • $f(\mathbf{x}_1, \ldots, \mathbf{x}_s)$ is implemented by a computer program producing intermediate variables $(\mathbf{x}_{s+1}, \ldots, \mathbf{x}_t)$,
  • $t$ is the total number of variables, with $\mathbf{x}_t$ denoting the output variable,
  • $\mathbf{x}_k \in \mathbb{R}^{n_k}$, for $k=1, \ldots, t$.

The goal is to compute the Jacobians $\frac{\partial \mathbf{f}}{\partial \mathbf{x}_k} \in \mathbb{R}^{n_t \times n_k}$, for $k=1, \ldots, s$.


class: middle

Computer programs as computational graphs

A numerical algorithm is a succession of instructions of the form $$\forall k = s+1, \ldots, t, \quad \mathbf{x}_k = \mathbf{f}_k(\mathbf{x}_1, \ldots, \mathbf{x}_{k-1})$$ where $\mathbf{f}_k$ is a function which only depends on the previous variables.


class: middle

.center.width-100[]

This computation can be represented by a directed acyclic graph where

  • the nodes are the variables $\mathbf{x}_k$,
  • an edge connects $x_i$ to $x_k$ if $x_i$ is an argument of $\mathbf{f}_k$.

The evaluation of $\mathbf{x}_t = \mathbf{f}(\mathbf{x}_1, \ldots, \mathbf{x}_s)$ thus corresponds to a forward traversal of this graph.


Forward mode

The forward mode of automatic differentiation consists in computing $$\frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_1} \in \mathbb{R}^{n_k \times n_1}$$ for all variables $\mathbf{x}_k$, iteratively from $k=s+1$ to $k=t$.

Initialization

Set the Jacobians of the input nodes with $$ \begin{aligned} \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_1} &= 1_{n_1 \times n_1} \\ \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} &= 0_{n_2 \times n_1} \\ \ldots \\ \frac{\partial \mathbf{x}_s}{\partial \mathbf{x}_1} &= 0_{n_s \times n_1} \end{aligned} $$


class: middle

Forward recursive update

For all $k=s+1, \ldots, t$, $$\frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_1} = \sum_{l \in \text{parents}(k)}\left[ \frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_l} \right] \times \frac{\partial \mathbf{x}_l}{\partial \mathbf{x}_1},$$ .grid[ .kol-1-2[ where

  • $\left[ \frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_l} \right]$ denotes the on-the-fly computation of the Jacobian locally associated to the primitive $\mathbf{f}_k$,
  • $\frac{\partial \mathbf{x}_l}{\partial \mathbf{x}_1}$ is obtained from the previous iterations (in topological order). ] .kol-1-2[
    .width-100[]] ]

class: middle, center

(whiteboard example)


class: middle

.alert[Forward mode automatic differentiation needs to be repeated for $k=1, \ldots, s$. For a large $s$, this is prohibitive.]

.success[However, the cost in terms of memory is limited since temporary variables can be freed as soon as their child nodes have all been computed.]


Backward mode

Instead of evaluating the Jacobians $\frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_1} \in \mathbb{R}^{n_k \times n_1}$ for $k=s+1, \ldots, t$, the reverse mode of automatic differentation consists in computing $$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k} \in \mathbb{R}^{n_t \times n_k}$$ recursively from $k=t$ down to $k=1$.

Initialization

Set the Jacobian of the output node to $$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_t} = 1_{n_t \times n_t}.$$


class: middle

Backward recursive update

For all $k=t-1, \ldots, 1$, $$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k} = \sum_{m \in \text{children}(k)} \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_m} \times \left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]$$ .grid[ .kol-1-2[ where

  • $\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_m}$ is obtained from previous iterations (in reverse topological order) and is known as the adjoint,
  • $\left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]$ denotes the on-the-fly computation of the Jacobian locally associated to the primitive $\mathbf{f}_m$. ] .kol-1-2[
    .center.width-100[]] ]

class: middle, center

(whiteboard example)


class: middle

.success[The advantage of backward mode automatic differentiation is that a single traversal of the graph allows to compute all $\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k}$.]

.alert[However, the cost in terms of memory is significant since all the temporary variables computed during the forward pass must be kept in memory.]


class: middle

Implementations


class: middle

.center[ .width-50[]

.width-50[]

.width-30[] ]


class: middle

Primitives

Most automatically-differentiable frameworks are defined by a collection of composable primitive operations.

.center.width-80[]


class: middle

Composing primitives

Primitive functions are composed together into a graph that describes the computation. The computational graph is either built

  • ahead of time, from the abstract syntax tree of the program or using a dedicated API (e.g., Tensorflow 1), or
  • just in time, by tracing the program execution (e.g., Tensorflow Eager, JAX, PyTorch).

class: middle

.center.width-35[]

VJPs

In the backward recursive update, in the situation above, we have when $\mathbf{x}_t \in \mathbb{R}$ $$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_k} = \underbrace{\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_m}}_{1 \times n_m} \underbrace{\left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]}_{n_m \times n_k}$$

  • Therefore, each primitive only needs to define its vector-Jacobian product (VJP). The Jacobian $\left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]$ is never explicitly built. It is usually simpler, faster, and more memory efficient to compute the VJP directly.
  • Most reverse mode AD systems compose VJPs backward to compute $\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_1}$.

class: middle

JVPs

Similarly, when $n_1 = 1$, the forward recursive update $$\frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_1} = \underbrace{\left[ \frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_l} \right]}_{n_k \times n_l} \underbrace{\frac{\partial \mathbf{x}_l}{\partial \mathbf{x}_1}}_{n_l \times 1}$$ is usually implemented in terms of Jacobian-vector products (JVP) locally defined at each primitive.


class: middle

Higher-order derivatives

def tanh(x):
    y = jnp.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)

fp = grad(tanh)
fpp = grad(grad(tanh))    # what sorcery is this?!
...

.alert[The backward pass is itself a composition of primitives. Its execution can be traced, and reverse mode AD can run on its computational graph!]


class: middle, center

(demo)


class: middle

AD beyond neural networks


class: middle, center, black-slide

<iframe width="600" height="450" src="https://www.youtube.com/embed/sq2gPzlrM0g?start=1240" frameborder="0" allowfullscreen></iframe>

You should be using automatic differentiation (Ryan Adams, 2016)


class: middle, center, black-slide

<iframe width="600" height="450" src="https://www.youtube.com/embed/YuVdk1b0TVw" frameborder="0" allowfullscreen></iframe>

Differentiable simulation for system identification and visuomotor control
(Murthy Jatavallabhula et al, 2021)


class: middle

.center[

.width-100[]

Optimizing a wing (Sam Greydanus, 2020)

[Run in browser] ]


class: middle, center

.width-75[]

... and plenty of other applications! (See this thread)


Summary

  • Automatic differentiation is one of the keys that enabled the deep learning revolution.
  • Backward mode automatic differentiation is more efficient when the function has more inputs than outputs.
  • Applications of AD go beyond deep learning.

class: end-slide, center count: false

The end.


count: false

References

Slides from this lecture have been largely adapted from: