Skip to content

viking-sudo-rm/norm-growth

Repository files navigation

Code for transformer norm growth paper

https://arxiv.org/abs/2010.09697

Setup

We recommend creating a virtual environment, and then running:

pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

The code also uses various environment variables to specify paths for training data and locations for saving models/cached data. You need to set this up before it will run. A basic way to do this would be:

export DATA=data  # Path for finding datasets.
export MODELS=/tmp/models  # Path for saving trained models.
export CACHED=/tmp/cached  # Path for saving cached experimental data.

A script for downloading the appropriate datasets into $DATA will be provided upon deanonymization.

Replication

All compute-intensive scripts support CUDA device options. Refer to their specific args options to set GPU usage (if no GPU flags exist, it probably means an available GPU will be used by default).

T5 norm analysis

To replicate running this code, you will need access to the T5 training data. If you wish to have this, contact the authors.

python t5_main.py  # Generate the cached data.
python t5_norm_regression.py  # Plot the data.

Train language models

To train transformers on Wikitext-2:

python finetune_trans.py --trans=control  # (Post-norm)
python finetune_trans.py --trans=pre_norm 

For PTB, we specify a different dataset and a shorter max sequence length:

python finetune_trans.py --trans=control --data=penn --seq_len=83
python finetune_trans.py --trans=pre_norm --data=penn --seq_len=83

The perplexity of a pretrained model be evaluated with:

python test_ppl.py --trans=control

By default, this will command look for a pretrained model that has been saved by finetune_trans.py with the same command line arguments.

LR vs. WD grid experiments

To generate the grid visualizing norm growth as a function of LR and WD, run:

python grid.py --data=penn --seq_len=83

Saturation in pretrained transformers

First, you need to get the Brown corpus data:

wget http://www.sls.hawaii.edu/bley-vroman/brown.txt > $DATA/brown.txt

Then, you can run the script as follows:

python eval_pretrain_sat_brown.py

Saturated attention heads

You can re-plot the histograms for attention heads using:

python saturate.py

Transformer homogeneity curves

This figure in the appendix can be generated by running:

python plot_trans_scale.py

Releases

No releases published

Packages

No packages published

Languages