Positional Encoding in Self-Attention

Take a look at different positional encoding schemes in self-attention:

Sinusoidal Attention Is All You Need
Learned
RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding
ALiBi Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Copy-Task

The toy model will solve a copy-task. The goal is to copy the sequence before the <copy> token after it.

e.g.:

1 7 2 <copy> _ _ _ _ _ _ → 1 7 2 <copy> 1 7 2 _ _ _
9 <copy> _ _ _ _ _ _ _ _ → 9 <copy> 9 _ _ _ _ _ _ _
2 2 4 3 <copy> _ _ _ _ _ → 2 2 4 3 <copy> 2 2 4 3 _
1 2 3 4 5 6 7 <copy> _ _ → 1 2 3 4 5 6 7 <copy> 1 2

Results

The model are trained on 2000 epochs, single-headed-attention, 2 layers, 20 embed_size. Each positional scheme is evaluated 5 times, and we plot the accuracy on the test set.

Compare attention activations

Running on: 7 1 8 2 <copy> _ _ _ _ _ → 7 1 8 2 <copy> 7 1 8 2 _

Learned postional encoding over time

Positional encoding (PCA to 2D) over time.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
gifs		gifs
imgs		imgs
weights		weights
README.md		README.md
attention.ipynb		attention.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gifs

gifs

imgs

imgs

weights

weights

README.md

README.md

attention.ipynb

attention.ipynb

Repository files navigation

Positional Encoding in Self-Attention

Copy-Task

Results

Compare attention activations

Learned postional encoding over time

dot product / cosine similarity

About

Releases

Packages

Languages

peluche/self-attention

Folders and files

Latest commit

History

Repository files navigation

Positional Encoding in Self-Attention

Copy-Task

Results

Compare attention activations

Learned postional encoding over time

dot product / cosine similarity

About

Topics

Resources

Stars

Watchers

Forks

Languages