Skip to content

The Transformer model implemented from scratch using PyTorch. The model uses weight sharing between the embedding layers and the pre-softmax linear layer. Training on the Multi30k machine translation task is shown.

Notifications You must be signed in to change notification settings

pi-tau/transformer

Repository files navigation

TRANSFORMER

The transformer model was first introduced in the paper:

  • "Attention is all you need" by Vaswani et. al., (here)

For a really thorough and in-depth discussion of the implementation of the Transformer you can check out a blog post that I wrote about it.

The transformer is used for modelling sequence-to-sequence tasks (like machine translation), where the input sequence is encoded using an encoder and then the output sequence is produced using a decoder.

TOKEN EMBEDDING

Before feeding the sequence to the transformer we have to pass the tokens through a standard embedding layer. We also need to encode the order of the sequence, since order information is not built-in. Thus, we use a position embedding layer, that maps each sequence index to a vector. The word embedding and the position embedding are then added and dropout is applied to produce the final token embedding.

$$\begin{aligned} TokenEmbed(elem_i): \\\ & emb_i = WordEmbed(elem_i) + PosEmbed(i) \\\ & x_i = Dropout(emb_i) \end{aligned}$$

"Embedding"

ATTENTION

The query, key and value linear layers ($Q$, $K$, $V$) are used to encode the input. Dot-products between query and key vectors produces the attention scores, which are used to perform weighted summation of the value vectors. Dropout is applied to the attention weights directly before the summation.

"Attention"

Each self-attention block has several sets of $Q$, $K$ and $V$ layers. Each set is called an attention head.

ENCODER

The encoder consists of $N$ identical blocks applied one after another. Each encoder block has two sub-layers: a self-attention layer followed by a position-wise fully-connected network. The block also incorporates layer normalization layers which are added before the sub-layers, and dropout layers added after the sub-layers. Finally, a residual connection is applied after both the self-attention and the fully-connected layers.

$$\begin{aligned} Encoder(x): \\\ & z = x + Dropout(Attn(Norm(x))) \\\ & r = z + Dropout(FFCN(Norm(z))) \end{aligned}$$

"Encoder"

DECODER

The decoder is also composed of $N$ identical blocks. The decoder block is very similar to the encoder block, but with two differences:

  1. The decoder uses masked self-attention, meaning that current sequence elements cannot attend future elements.
  2. In addition to the two sub-layers, the decoder uses a third sub-layer, which performs cross-attention between the decoded sequence and the outputs of the encoder.
$$\begin{aligned} Decoder(x): \\\ & z = x + Dropout(MaskAttn(Norm(x))) \\\ & c = z + Dropout(CrossAttn(Norm(z))) \\\ & r = c + Dropout(FFCN(Norm(c))) \end{aligned}$$

"Decoder"

TRANSFORMER

The full transformer model is constructed by plugging in the outputs of the final encoder block to the cross-attention layer of every decoder block. Finally, the outputs of the final decoder block are forwarded through a linear layer to produce scores over the target vocabulary.

"Transformer"

About

The Transformer model implemented from scratch using PyTorch. The model uses weight sharing between the embedding layers and the pre-softmax linear layer. Training on the Multi30k machine translation task is shown.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages