Deep Learning

Lecture 10: Uncertainty

Prof. Gilles Louppe
g.louppe@uliege.be

???

R: Code the GMM example R: Code the NF with coupling layers and visualize the transformations

.center.width-60[]

.center.circle.width-30[]

.italic["Every time a scientific paper presents a bit of data, it's accompanied by an .bold[error bar] – a quiet but insistent reminder that no knowledge is complete or perfect. It's a .bold[calibration of how much we trust what we think we know]."]

???

Knowledge is an artefact. It is a mental construct.

Uncertainty is how much we trust this construct.

Today

How to estimate uncertainty with and of neural networks?

Uncertainty
Aleatoric uncertainty
Epistemic uncertainty

Uncertainty

Uncertainty refers to situations where there is .bold[imperfect or unknown information]. It can arise in predictions of future events, in physical measurements, or in situations where information is unknown.

Accounting for uncertainty is necessary for making optimal decisions. Not accounting for uncertainty can lead to suboptimal, wrong, or even catastrophic decisions.

.italic[Case 1]. First assisted driving fatality in May 2016: Perception system mistook trailer's white side for bright sky.

.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]

]

.center.width-60[]

.italic[Case 2]. An image classification system erroneously identifies two African Americans as gorillas, raising concerns of racial discrimination.

.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]

.alert[The systems that made these errors were likely confident in their predictions. They did not account for uncertainty.]

Aleatoric uncertainty

Aleatoric uncertainty refers to the uncertainty arising from the inherent stochasticity of the true data generating process. This uncertainty .bold[cannot be reduced] with more data.

A common example is observational noise due to the limitations of the measurement devices. Collecting more data will not reduce the noise.

Assumptions about the data generating process can help in distinguishing between different types of aleatoric uncertainty:

Homoscedastic uncertainty, which is constant across the input space.
Heteroscedastic uncertainty, which varies across the input space.

.center.width-90[![](figures/lec10/homo-vs-hetero.png)]

Neural density estimation

Consider training data $(\mathbf{x}, y) \sim p(\mathbf{x}, y)$, with

$\mathbf{x} \in \mathbb{R}^p$,
$y \in \mathbb{R}$.

We do not wish to learn a function $\hat{y} = f(\mathbf{x})$, which would only produce point estimates.

Instead we want to learn the full conditional density $$p(y|\mathbf{x}).$$

NN with Gaussian output layer

We can model aleatoric uncertainty in the output by modelling the conditional distribution as a Gaussian distribution, $$p(y|\mathbf{x}) = \mathcal{N}(y; \mu(\mathbf{x}), \sigma^2(\mathbf{x})),$$ where $\mu(x)$ and $\sigma^2(\mathbf{x})$ are parametric functions to be learned, such as neural networks.

Note: The Gaussian distribution is a modelling choice. Other parametric distributions can be used.

.center.width-80[]

We have, $$\begin{aligned} &\arg \max_{\theta,\sigma^2} p(\mathbf{d}|\theta,\sigma^2) \\ &= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y_i|\mathbf{x}_i, \theta,\sigma^2) \\ &= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2}\right) \\ &= \arg \min_{\theta,\sigma^2} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2} + \log(\sigma) + C \end{aligned}$$

.center.width-80[]

Same as for the homoscedastic case, except that that $\sigma^2$ is now a function of $\mathbf{x}_i$: $$\begin{aligned} &\arg \max_{\theta} p(\mathbf{d}|\theta) \\ &= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y_i|\mathbf{x}_i, \theta) \\ &= \arg \max_{\theta} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{1}{\sqrt{2\pi} \sigma(\mathbf{x}_i)} \exp\left(-\frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2(\mathbf{x}_i)}\right) \\ &= \arg \min_{\theta} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2(\mathbf{x}_i)} + \log(\sigma(\mathbf{x}_i)) + C \end{aligned}$$

.question[What is the purpose of $2\sigma^2(\mathbf{x}_i)$? What about $\log(\sigma(\mathbf{x}_i))$?]

???

Take care of properly parametrizing $\sigma^2(\mathbf{x}_i)$ to ensure that it is positive.

Modelling $p(y|\mathbf{x})$ as a unimodal (Gaussian) distribution can be inadequate since the conditional distribution may be .bold[multimodal].

???

Illustrate on the blackboard.

Gaussian mixture model

A Gaussian mixture model (GMM) defines instead $p(y|\mathbf{x})$ as a mixture of $K$ Gaussian components, $$p(y|\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(y;\mu_k, \sigma_k^2),$$ where $0 \leq \pi_k \leq 1$ for all $k$ and $\sum_{k=1}^K \pi_k = 1$.

.center.width-60[]

A .bold[mixture density network] (MDN) is a neural network implementation of the Gaussian mixture model.

.center.width-100[]

Illustration

Let us consider training data generated randomly as $$y_i = \mathbf{x}_i + 0.3\sin(4\pi \mathbf{x}_i) + \epsilon_i$$ with $\epsilon_i \sim \mathcal{N}$.

The data can be fit with a 2-layer network producing point estimates for $y$ (demo).

]

If we flip $\mathbf{x}_i$ and $y_i$, the network faces issues since for each input, there are multiple outputs that can work. It produces an average of the correct values (demo).

]

A mixture density network models the data correctly, as it predicts for each input a distribution for the output, rather than a point estimate (demo).

]

Normalizing flows

Gaussian mixture models are a flexible way to model multimodal distributions, but they are limited by the number of components $K$, which must be large to model complex distributions.

Normalizing flows are a more flexible way to model complex distributions.

Change of variables

.center.width-80[]

Assume $p(\mathbf{z})$ is a uniformly distributed unit cube in $\mathbb{R}^3$ and $\mathbf{x} = f(\mathbf{z}) = 2\mathbf{z}$. Since the total probability mass must be conserved, $$p(\mathbf{x})=p(\mathbf{x}=f(\mathbf{z})) = p(\mathbf{z})\frac{V_\mathbf{z}}{V_\mathbf{x}}=p(\mathbf{z}) \frac{1}{8},$$ where $\frac{1}{8} = \left| \det \left( \begin{matrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{matrix} \right)\right|^{-1}$ represents the inverse determinant of the Jacobian of the linear transformation $f$.

???

Motivate that picking a parametric family of distributions is not always easy. We want something more flexible.

What if $f$ is non-linear?

.center.width-70[]

Change of variables theorem

If $f$ is non-linear,

the Jacobian $J_f(\mathbf{z})$ of $\mathbf{x} = f(\mathbf{z})$ represents the infinitesimal linear transformation in the neighborhood of $\mathbf{z}$;
if the function is a bijective map, then the mass must be conserved locally.

Therefore, the local change of density yields $$p(\mathbf{x}=f(\mathbf{z})) = p(\mathbf{z})\left| \det J_f(\mathbf{z}) \right|^{-1}.$$

Similarly, for $g = f^{-1}$, we have $$p(\mathbf{x})=p(\mathbf{z}=g(\mathbf{x}))\left| \det J_g(\mathbf{x}) \right|.$$

???

The Jacobian matrix of a function f: R^n -> R^m at a point z in R^n is an m x n matrix that represents the linear transformation induced by the function at that point. Geometrically, the Jacobian matrix can be thought of as a matrix of partial derivatives that describes how the function locally stretches or shrinks areas and volumes in the vicinity of the point z.

The determinant of the Jacobian matrix of f at z has a geometric interpretation as the factor by which the function locally scales areas or volumes. Specifically, if the determinant is positive, then the function locally expands areas and volumes, while if it is negative, the function locally contracts areas and volumes. The absolute value of the determinant gives the factor by which the function scales the areas or volumes.

Example: coupling layers

Assume $\mathbf{z} = (\mathbf{z}_a, \mathbf{z}_b)$ and $\mathbf{x} = (\mathbf{x}_a, \mathbf{x}_b)$. Then,

Forward mapping $\mathbf{x} = f(\mathbf{z})$: $$\mathbf{x}_a = \mathbf{z}_a, \quad \mathbf{x}_b = \mathbf{z}_b \odot \exp(s(\mathbf{z}_a)) + t(\mathbf{z}_a),$$
Inverse mapping $\mathbf{z} = g(\mathbf{x})$: $$\mathbf{z}_a = \mathbf{x}_a, \quad \mathbf{z}_b = (\mathbf{x}_b - t(\mathbf{x}_a)) \odot \exp(-s(\mathbf{x}_a)),$$

where $s$ and $t$ are arbitrary neural networks.

???

Draw the coupling layer on the blackboard.

For $\mathbf{x} = (\mathbf{x}_a, \mathbf{x}_b)$, the log-likelihood is $$\begin{aligned}\log p(\mathbf{x}) &= \log p(\mathbf{z}) \left| \det J_f(\mathbf{z}) \right|^{-1}\end{aligned}$$ where the Jacobian $J_f(\mathbf{z}) = \frac{\partial \mathbf{x}}{\partial \mathbf{z}}$ is a lower triangular matrix $$\left( \begin{matrix} \mathbf{I} & 0 \\ \frac{\partial \mathbf{x}_b}{\partial \mathbf{z}_a} & \text{diag}(\exp(s(\mathbf{z}_a))) \end{matrix} \right),$$ such that $\left| \det J_f(\mathbf{z}) \right| = \prod_i \exp(s(\mathbf{z}_a))_i = \exp(\sum_i s(\mathbf{z}_a)_i)$.

Therefore, the log-likelihood is $$\begin{aligned}\log p(\mathbf{x}) &= \log p(\mathbf{z}) - \sum_i s(\mathbf{z}_a)_i\end{aligned}$$

Normalizing flows

A normalizing flow is a change of variable $f$ that transforms a base distribution $p(\mathbf{z})$ into $p(\mathbf{x})$ through a discrete sequence of invertible transformations.

.center.width-100[![](figures/lec10/FlowTransformLayers.svg)]

Formally, $$\begin{aligned} &\mathbf{z}_0 \sim p(\mathbf{z}) \\ &\mathbf{z}_k = f_k(\mathbf{z}_{k-1}), \quad k=1,...,K \\ &\mathbf{x} = \mathbf{z}_K = f_K \circ ... \circ f_1(\mathbf{z}_0). \end{aligned}$$

The change of variable theorem yields $$\log p(\mathbf{x}) = \log p(\mathbf{z}_0) - \sum_{k=1}^K \log \left| \det J_{f_k}(\mathbf{z}_{k-1}) \right|.$$

.center.width-90[]

Conditional normalizing flows

Normalizing flows can also estimate densities $p(\mathbf{x} | c)$ conditioned on a context $c$.

Transformations are made conditional by taking $c$ as an additional input. For example, in a coupling layer, the networks can be upgraded to $s(\mathbf{z}, c)$ and $t(\mathbf{z}, c)$.
Optionally, the base distribution $p(\mathbf{z})$ can also be made conditional on $c$.

(Accordingly, aleatoric uncertainty of some output $y$ conditioned on an input $\mathbf{x}$ can be modelled by a conditional normalizing flow $p(y|\mathbf{x})$ where the context $c$ is the input $\mathbf{x}$.)

.center.width-100[]

Continuous-time normalizing flows

.grid[ .kol-1-2[ Replace the discrete sequence of transformations with a neural ODE with reversible dynamics such that $$\begin{aligned} &\mathbf{z}_0 \sim p(\mathbf{z})\\ &\frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)\\ &\mathbf{x} = \mathbf{z}(1) = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t) dt. \end{aligned}$$ ] .kol-1-2.center[ ] ]

The instantaneous change of variable yields $$\log p(\mathbf{x}) = \log p(\mathbf{z}(0)) - \int_0^1 \text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}(t)} \right) dt.$$

Epistemic uncertainty

Epistemic uncertainty accounts for uncertainty in the model or in its parameters. It captures our ignorance about which model can best explain the collected data. It .bold[can be explained away] given enough data.

.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]

???

Once we have decided on a model of the true data generating process, we face uncertainty in how much we can trust the model or its parameters.

Bayesian neural networks

To capture epistemic uncertainty in a neural network, we model our ignorance with a prior distribution $p(\mathbf{\omega})$ over its weights and estimate the posterior distribution $p(\mathbf{\omega}|\mathbf{d})$ given the training set $\mathbf{d}$.

The prior predictive distribution at $\mathbf{x}$ is given by integrating over all possible weight configurations, $$p(y|\mathbf{x}) = \int p(y|\mathbf{x}, \mathbf{\omega}) p(\mathbf{\omega}) d\mathbf{\omega}.$$

Given training data $\mathbf{d}=\{(\mathbf{x}_1, y_1), ..., (\mathbf{x}_N, y_N)\}$ a Bayesian update results in the posterior $$p(\mathbf{\omega}|\mathbf{d}) = \frac{p(\mathbf{d}|\mathbf{\omega})p(\mathbf{\omega})}{p(\mathbf{d})}$$ where the likelihood $p(\mathbf{d}|\omega) = \prod_i p(y_i | \mathbf{x}_i, \omega).$

The posterior predictive distribution is then given by $$p(y|\mathbf{x},\mathbf{d}) = \int p(y|\mathbf{x}, \mathbf{\omega}) p(\mathbf{\omega}|\mathbf{d}) d\mathbf{\omega}.$$

Bayesian neural networks are easy to formulate, but notoriously .bold[difficult] to perform inference in.

$p(\mathbf{d})$ is intractable to evaluate, which results in the posterior $p(\mathbf{\omega}|\mathbf{d})$ not being tractable either.

Therefore, we must rely on approximations.

Variational inference

Variational inference can be used for building an approximation $q(\mathbf{\omega};\nu)$ of the posterior $p(\mathbf{\omega}|\mathbf{d})$.

We can show that minimizing $$\text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega}|\mathbf{d}))$$ with respect to the variational parameters $\nu$, is identical to maximizing the evidence lower bound objective (ELBO) $$\text{ELBO}(\nu) = \mathbb{E}_{q(\mathbf{\omega};\nu)} \left[\log p(\mathbf{d}| \mathbf{\omega})\right] - \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})).$$

???

Do it on the blackboard.

The integral in the ELBO is not tractable for almost all $q$, but it can be maximized with stochastic gradient ascent:

Sample $\hat{\omega} \sim q(\mathbf{\omega};\nu)$.
Do one step of maximization with respect to $\nu$ on $$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \log\frac{q(\hat{\omega};\nu)}{p(\hat{\omega})} $$

In the context of Bayesian neural networks, this procedure is also known as Bayes by backprop (Blundell et al, 2015).

Dropout

Dropout is an empirical technique that was first proposed to avoid overfitting in neural networks.

At each training step:

Remove each node in the network with a probability $p$.
Update the weights of the remaining nodes with backpropagation.

.center.width-70[]

???

Remind the students we used Dropout in Lec 8 when implementing a Transformer.

At test time, either:

Make predictions using the trained network without dropout but rescaling the weights by the dropout probability $p$ (fast and standard).
Sample $T$ neural networks using dropout and average their predictions (slower but better principled).

Why does dropout work?

It makes the learned weights of a node less sensitive to the weights of the other nodes.
This forces the network to learn several independent representations of the patterns and thus decreases overfitting.
It approximates Bayesian model averaging.

Dropout does variational inference

What variational family $q$ would correspond to dropout?

Let us split the weights $\omega$ per layer, $\omega = \{ \mathbf{W}_1, ..., \mathbf{W}_L \},$ where $\mathbf{W}_i$ is further split per unit $\mathbf{W}_i = \{ \mathbf{w}_{i,1}, ..., \mathbf{w}_{i,q_i} \}.$
Variational parameters $\nu$ are split similarly into $\nu = \{ \mathbf{M}_1, ..., \mathbf{M}_L \}$, with $\mathbf{M}_i = \{ \mathbf{m}_{i,1}, ..., \mathbf{m}_{i,q_i} \}$.
Then, the proposed $q(\omega;\nu)$ is defined as follows: $$ \begin{aligned} q(\omega;\nu) &= \prod_{i=1}^L q(\mathbf{W}_i; \mathbf{M}_i) \\ q(\mathbf{W}_i; \mathbf{M}_i) &= \prod_{k=1}^{q_i} q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) \\ q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) &= p\delta_0(\mathbf{w}_{i,k}) + (1-p)\delta_{\mathbf{m}_{i,k}}(\mathbf{w}_{i,k}) \end{aligned} $$ where $\delta_a(x)$ denotes a (multivariate) Dirac distribution centered at $a$.

???

Note that this assumes the parameterization $\mathbf{h} = \mathbf{W}\mathbf{x}$, without the transpose on $\mathbf{W}$.

Given the previous definition for $q$, sampling parameters $\hat{\omega} = \{ \hat{\mathbf{W}}_1, ..., \hat{\mathbf{W}}_L \}$ is done as follows:

Draw binary $z_{i,k} \sim \text{Bernoulli}(1-p)$ for each layer $i$ and unit $k$.
Compute $\hat{\mathbf{W}}_i = \mathbf{M}_i \text{diag}([z_{i,k}]_{k=1}^{q_{i-1}})$, where $\mathbf{M}_i$ denotes a matrix composed of the columns $\mathbf{m}_{i,k}$.

.grid[ .kol-3-5[ That is, $\hat{\mathbf{W}}_i$ are obtained by setting columns of $\mathbf{M}_i$ to zero with probability $p$.

This is strictly equivalent to dropout, i.e. removing units from the network with probability $p$.

] .kol-2-5[.center.width-100[]] ]

Therefore, one step of stochastic gradient descent on the ELBO becomes:

Sample $\hat{\omega} \sim q(\mathbf{\omega};\nu)$ $\Leftrightarrow$ Randomly set units of the network to zero $\Leftrightarrow$ Dropout.
Do one step of maximization with respect to $\nu = \{ \mathbf{M}_i \}$ on $$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})).$$

Maximizing $\hat{L}(\nu)$ is equivalent to minimizing $$-\hat{L}(\nu) = -\log p(\mathbf{d}|\hat{\omega}) + \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})) $$

This is also equivalent to one minimization step of a standard classification or regression objective:

The first term is the typical objective (such as the cross-entropy).
The second term forces $q$ to remain close to the prior $p(\omega)$.
- If $p(\omega)$ is Gaussian, minimizing the $\text{KL}$ is equivalent to $\ell_2$ regularization.
- If $p(\omega)$ is Laplacian, minimizing the $\text{KL}$ is equivalent to $\ell_1$ regularization.

Conversely, this shows that when training a network with dropout with a standard classification or regression objective, one is actually implicitly doing variational inference to match the posterior distribution of the weights.

Uncertainty estimates from dropout

Proper uncertainty estimates at $\mathbf{x}$, accounting for both the aleatoric and epistemic uncertainties, can be obtained in a principled way using Monte-Carlo integration:

Draw $T$ sets of network parameters $\hat{\omega}_t$ from $q(\omega;\nu)$.
Compute the predictions for the $T$ networks, $\{ f(\mathbf{x};\hat{\omega}_t) \}_{t=1}^T$.
Approximate the predictive mean and variance as $$ \begin{aligned} \mathbb{E}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t) \\ \mathbb{V}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \sigma^2 + \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t)^2 - \hat{\mathbb{E}}\left[y\right]^2, \end{aligned} $$ where $\sigma^2$ is the assumed level of noise in the observational model.

.center.width-80[]

(demo)

Pixel-wise depth regression

.center.width-80[]

.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]

Bayesian Infinite Networks

Consider the 1-layer MLP with a hidden layer of size $q$ and a bounded activation function $\sigma$:

$$\begin{aligned} f(x) &= b + \sum_{j=1}^q v_j h_j(x)\\\ h_j(x) &= \sigma\left(a_j + \sum_{i=1}^p u_{i,j}x_i\right) \end{aligned}$$

Assume Gaussian priors $v_j \sim \mathcal{N}(0, \sigma_v^2)$, $b \sim \mathcal{N}(0, \sigma_b^2)$, $u_{i,j} \sim \mathcal{N}(0, \sigma_u^2)$ and $a_j \sim \mathcal{N}(0, \sigma_a^2)$.

For a fixed value $x^{(1)}$, let us consider the prior distribution of $f(x^{(1)})$ implied by the prior distributions for the weights and biases.

We have $$\mathbb{E}[v_j h_j(x^{(1)})] = \mathbb{E}[v_j] \mathbb{E}[h_j(x^{(1)})] = 0,$$ since $v_j$ and $h_j(x^{(1)})$ are statistically independent and $v_j$ has zero mean by hypothesis.

The variance of the contribution of each hidden unit $h_j$ is $$\begin{aligned} \mathbb{V}[v_j h_j(x^{(1)})] &= \mathbb{E}[(v_j h_j(x^{(1)}))^2] - \mathbb{E}[v_j h_j(x^{(1)})]^2 \\ &= \mathbb{E}[v_j^2] \mathbb{E}[h_j(x^{(1)})^2] \\ &= \sigma_v^2 \mathbb{E}[h_j(x^{(1)})^2], \end{aligned}$$ which must be finite since $h_j$ is bounded by its activation function.

We define $V(x^{(1)}) = \mathbb{E}[h_j(x^{(1)})^2]$, and is the same for all $j$.

What if $q \to \infty$?

By the Central Limit Theorem, as $q \to \infty$, the total contribution of the hidden units, $\sum_{j=1}^q v_j h_j(x)$, to the value of $f(x^{(1)})$ becomes a Gaussian with variance $q \sigma_v^2 V(x^{(1)})$.

The bias $b$ is also Gaussian, of variance $\sigma_b^2$, so for large $q$, the prior distribution $f(x^{(1)})$ is a Gaussian of variance $\sigma_b^2 + q \sigma_v^2 V(x^{(1)})$.

Accordingly, for $\sigma_v = \omega_v q^{-\frac{1}{2}}$, for some fixed $\omega_v$, the prior $f(x^{(1)})$ converges to a Gaussian of mean zero and variance $\sigma_b^2 + \omega_v^2 \sigma_v^2 V(x^{(1)})$ as $q \to \infty$.

For two or more fixed values $x^{(1)}, x^{(2)}, ...$, a similar argument shows that, as $q \to \infty$, the joint distribution of the outputs converges to a multivariate Gaussian with means of zero and covariances of $$\begin{aligned} \mathbb{E}[f(x^{(1)})f(x^{(2)})] &= \sigma_b^2 + \sum_{j=1}^q \sigma_v^2 \mathbb{E}[h_j(x^{(1)}) h_j(x^{(2)})] \\ &= \sigma_b^2 + \omega_v^2 C(x^{(1)}, x^{(2)}) \end{aligned}$$ where $C(x^{(1)}, x^{(2)}) = \mathbb{E}[h_j(x^{(1)}) h_j(x^{(2)})]$ and is the same for all $j$.

This result states that for any set of fixed points $x^{(1)}, x^{(2)}, ...$, the joint distribution of $f(x^{(1)}), f(x^{(2)}), ...$ is a multivariate Gaussian.

In other words, the infinitely wide 1-layer MLP converges towards a Gaussian process.

.center.width-80[]

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture10.md

lecture10.md

Deep Learning

Today

Uncertainty

Aleatoric uncertainty

Neural density estimation

NN with Gaussian output layer

Gaussian mixture model

Illustration

Normalizing flows

Change of variables

Change of variables theorem

Example: coupling layers

Normalizing flows

Conditional normalizing flows

Continuous-time normalizing flows

Epistemic uncertainty

Bayesian neural networks

Variational inference

Dropout

Why does dropout work?

Dropout does variational inference

Uncertainty estimates from dropout

Pixel-wise depth regression

Bayesian Infinite Networks

What if $q \to \infty$?

Files

lecture10.md

Latest commit

History

lecture10.md

File metadata and controls

Deep Learning

Today

Uncertainty

Aleatoric uncertainty

Neural density estimation

NN with Gaussian output layer

Gaussian mixture model

Illustration

Normalizing flows

Change of variables

Change of variables theorem

Example: coupling layers

Normalizing flows

Conditional normalizing flows

Continuous-time normalizing flows

Epistemic uncertainty

Bayesian neural networks

Variational inference

Dropout

Why does dropout work?

Dropout does variational inference

Uncertainty estimates from dropout

Pixel-wise depth regression

Bayesian Infinite Networks

What if $q \to \infty$?