Deep Learning

Lecture 8: GPT and Large Language Models

Prof. Gilles Louppe
g.louppe@uliege.be

???

R: refresh with Foundation Models

Today

BabyGPT
Large language models

Large language models

.center.width-100[]

Decoder-only transformers

The decoder-only transformer has become the de facto architecture for large language models.

These models are trained with self-supervised learning, where the target sequence is the same as the input sequence, but shifted by one token to the right.

.center.width-80[]

Historically, GPT-1 was first pre-trained and then fine-tuned on downstream tasks.

.footnote[Credits: Radford et al., Improving Language Understanding by Generative Pre-Training, 2018.]

Scaling laws

Transformer language model performance improves smoothly as we increase the model size, the dataset size, and amount of compute used for training.

For optimal performance, all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.

.center.width-100[]

Large models also enjoy better sample efficiency than small models.

Larger models require less data to achieve the same performance.
The optimal model size shows to grow smoothly with the amount of compute available for training.

.center.width-100[![](./figures/lec8/scaling-sample-conv.png)]

In-context learning

GPT-2 and following models demonstrated potential of using the same language model for multiple tasks, .bold[without updating the model weights].

Zero-shot, one-shot and few-shot learning consist in prompting the model with a few examples of the target task and letting it learn from them. This paradigm is called in-context learning.

.center.width-100[]

(demo)

Emergent abilities

As language models grow in size, they start to exhibit emergent abilities that are not present in the original training data.

A (few-shot) prompted task is .bold[emergent] if it achieves random performance for small models and then (suddenly) improves as the model size increases.

.center.width-100[]

Notably, chain-of-thought reasoning is an emergent ability of large language models. It improves performance on a wide range of arithmetica, commonsense, and symbolic reasoning tasks.

.center.width-100[]

.center.width-50[]

Alignment

Increasing the model size does not inherently makes models follow a user's intent better, despite emerging abilities.

Worse, scaling up the model may increase the likelihood of undesirable behaviors, including those that are harmful, unethical, or biased.

Human feedback can be used for better aligning language models with human intent, as shown by InstructGPT.

.center.width-80[]

.center.width-100[]

The end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lecture8.md

lecture8.md

Deep Learning

Today

Large language models

Decoder-only transformers

Scaling laws

In-context learning

Emergent abilities

Alignment

Files

lecture8.md

Latest commit

History

lecture8.md

File metadata and controls

Deep Learning

Today

Large language models

Decoder-only transformers

Scaling laws

In-context learning

Emergent abilities

Alignment