Multi-head attention (created with DALL.E 3)
A step-by-step derivation and implementation of the GPT architecture from scratch, following the original paper on GPT: Improving Language Understanding by Generative Pre-Training (Radford et al. 2018) and the transformer model: Attention is All You Need (Vaswani et al. 2017). This is mostly a personal exercise to deepen my understanding on multi-head self-attention, transformer, causal languaging modelling and unsupervised pretraining, but can also serve as a guide for anyone interested to derive the GPT architecture from first principle.
- PyTorch>=2.1.0
The complete derivation walkthrough is on the Jupyter notebook derive-gpt-from-scratch.ipynb
.
At the end of the walkthrough, we will get a GPT model that can write Shakespeare-style plays (or gibberish).
This project references the following resources:
- Improving Language Understanding by Generative Pre-Training (Radford et al. 2018)
- Attention is All You Need (Vaswani et al. 2017)
- GPT Guide by Andrej Karpathy
This project is licensed under the MIT License. Please see the LICENSE file for more details.