Skip to content

Latest commit

 

History

History
182 lines (122 loc) · 10.8 KB

File metadata and controls

182 lines (122 loc) · 10.8 KB

Transformer

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Transformer blocks are characterized by a multi-head self-attention mechanism, a position-wise feed-forward network, layer normalization modules and residual connectors.

Attention Mechanism

Attention mechanisms in neural networks, otherwise known as neural attention or just attention, have recently attracted a lot of attention (pun intended).

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.


Attention distribution is a probability distribution to describe how much we pay attention into the elements in a sequence for some specific task.

For example, we have a query vector $\mathbf{q}$ associated with the task and a list of input vector $\mathbf{X}=[\mathbf{x}_1, \mathbf{x}_2,\cdots, \mathbf{x}_N]$. We can select the input vector smoothly with the weight $$\alpha_i=p(z=i\mid \mathbf{X}, \mathbf{q})\ =\frac{\exp(s(\mathbf{x}i, \mathbf{q}))}{\sum{j=1}^N\exp(s(\mathbf{x}_j, \mathbf{q}))}$$ where $\alpha_i$ is the attention distribution, $s(\cdot,\cdot)$ is the attention scoring function. $z$ is the index of the vector in $\mathbf{X}$. The "Scaled Dot-Product Attention" use the modified dot-product as the scoring function.


The attention function between different input vectors is calculated as follows:

  1. Step 1: Compute scores between different input vectors and query vector $S_N$;
  2. Step 2: Translate the scores into probabilities such as $P = \operatorname{softmax}(S_N)$;
  3. Step 3: Obtain the output as aggregation such as the weighted value matrix with $Z = \mathbb{E}_{z\sim p(\mid \mathbf{X}, \mathbf{q} )}\mathbf{[x]}$.

There are diverse scoring functions and probability translation function, which will calculate the attention distribution in different ways.

Efficient Attention, Linear Attention apply more efficient methods to generate attention weights.

Key-value Attention Mechanism and Self-Attention use different input sequence as following $$\operatorname{att}(\mathbf{K, V}, \mathbf{q}) =\sum_{j=1}^N\frac{s(\mathbf{K}_j, q)\mathbf{V}j}{\sum{i=1}^N s(\mathbf{K}_i, q)}$$ where $\mathbf{K}$ is the key matrix, $\mathbf{V}$ is the value matrix, $s(\cdot, \cdot)$ is the positive similarity function.

Each input token in self-attention receives three representations corresponding to the roles it can play:

  • query - asking for information;
  • key - saying that it has some information;
  • value - giving the information.

We compute the dot products of the query with all keys, divide each by square root of key dimension $d_k$, and apply a softmax function to obtain the weights on the values as following.


Soft Attention: the alignment weights are learned and placed “softly” over all patches in the source image; essentially the same type of attention as in Bahdanau et al., 2015. And each output is derived from an attention averaged input.

  • Pro: the model is smooth and differentiable.
  • Con: expensive when the source input is large.

Hard Attention: only selects one patch of the image to attend to at a time, which attends to exactly one input state for an output.

  • Pro: less calculation at the inference time.
  • Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train. (Luong, et al., 2015)

Soft Attention Mechanism

Soft Attention Mechanism is to output the weighted sum of vector with differentiable scoring function:

$$\operatorname{att}(\mathbf{X}, \mathbf{q}) = \mathbb{E}_{z\sim p(z\mid \mathbf{X}, \mathbf{q} )}\mathbf{[x]}$$

where $p(z\mid \mathbf{X}, \mathbf{q} )$ is the attention distribution.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as: $$\operatorname{Attention}(Q, K, V)= [\mathbb{E}_{z\sim p(z\mid \mathbf{K}, \mathbf{Q}1)}\mathbf{[V]},\cdots,\mathbb{E}{z\sim p(z\mid \mathbf{K}, \mathbf{Q}i )}\mathbf{[V]},\cdots, \mathbb{E}{z\sim p(z\mid \mathbf{K}, \mathbf{Q}_N )}\mathbf{[V]}].$$


Hard Attention Mechanism

Hard Attention Mechanism is to select most likely vector as the output $$\operatorname{att}(\mathbf{X}, \mathbf{q}) = \mathbf{x}j$$ where $j=\arg\max{i}\alpha_i$.

It is trained using sampling method or reinforcement learning.

Sparse Attention Mechanism

The softmax mapping is elementwise proportional to $exp$, therefore it can never assign a weight of exactly zero. Thus, unnecessary items are still taken into consideration to some extent. Since its output sums to one, this invariably means less weight is assigned to the relevant items, potentially harming performance and interpretability.

Sparse Attention Mechanism is aimed at generating sparse attention distribution as a trade-off between soft attention and hard attention.

Graph Attention Networks

GAT introduces the attention mechanism as a substitute for the statically normalized convolution operation.