Code for transformer norm growth paper

https://arxiv.org/abs/2010.09697

Setup

We recommend creating a virtual environment, and then running:

pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

The code also uses various environment variables to specify paths for training data and locations for saving models/cached data. You need to set this up before it will run. A basic way to do this would be:

export DATA=data  # Path for finding datasets.
export MODELS=/tmp/models  # Path for saving trained models.
export CACHED=/tmp/cached  # Path for saving cached experimental data.

A script for downloading the appropriate datasets into $DATA will be provided upon deanonymization.

Replication

All compute-intensive scripts support CUDA device options. Refer to their specific args options to set GPU usage (if no GPU flags exist, it probably means an available GPU will be used by default).

T5 norm analysis

To replicate running this code, you will need access to the T5 training data. If you wish to have this, contact the authors.

python t5_main.py  # Generate the cached data.
python t5_norm_regression.py  # Plot the data.

Train language models

To train transformers on Wikitext-2:

python finetune_trans.py --trans=control  # (Post-norm)
python finetune_trans.py --trans=pre_norm

For PTB, we specify a different dataset and a shorter max sequence length:

python finetune_trans.py --trans=control --data=penn --seq_len=83
python finetune_trans.py --trans=pre_norm --data=penn --seq_len=83

The perplexity of a pretrained model be evaluated with:

python test_ppl.py --trans=control

By default, this will command look for a pretrained model that has been saved by finetune_trans.py with the same command line arguments.

LR vs. WD grid experiments

To generate the grid visualizing norm growth as a function of LR and WD, run:

python grid.py --data=penn --seq_len=83

Saturation in pretrained transformers

First, you need to get the Brown corpus data:

wget http://www.sls.hawaii.edu/bley-vroman/brown.txt > $DATA/brown.txt

Then, you can run the script as follows:

python eval_pretrain_sat_brown.py

Saturated attention heads

You can re-plot the histograms for attention heads using:

python saturate.py

Transformer homogeneity curves

This figure in the appendix can be generated by running:

python plot_trans_scale.py

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
figs		figs
old		old
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
brown.txt		brown.txt
eval_pretrain_sat_brown.py		eval_pretrain_sat_brown.py
finetune_trans.py		finetune_trans.py
grid.py		grid.py
plot_grid.py		plot_grid.py
plot_grid_v2.py		plot_grid_v2.py
plot_lr_norms.py		plot_lr_norms.py
plot_trans_scale.py		plot_trans_scale.py
requirements.txt		requirements.txt
saturate.py		saturate.py
schedulers.py		schedulers.py
t5_main.py		t5_main.py
t5_norm_regression.py		t5_norm_regression.py
test_ppl.py		test_ppl.py
track_gp.py		track_gp.py

viking-sudo-rm/norm-growth

Folders and files

Latest commit

History

Repository files navigation

Code for transformer norm growth paper

Setup

Replication

T5 norm analysis

Train language models

LR vs. WD grid experiments

Saturation in pretrained transformers

Saturated attention heads

Transformer homogeneity curves

About

Resources

Stars

Watchers

Forks

Languages