Activation Compression with Guarantees

We explore how to fine-tune language models over slow networks using activation compression with guarantees. This is a research project developed by DS3Lab@ETH Zurich and HazyResearch@Stanford.

Cite Our Paper

@article{jue2022fine,
  title={Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees}, 
  author={Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang},
  year={2022},
  eprint={2206.01299},
  archivePrefix={arXiv},
  primaryClass={cs.DC}
}

Setup:

Create environment:

conda create -n acsgd python=3.8
conda activate acsgd

Install PyTorch env:

pip3 install torch==1.9.0+cu111 torchtext -f https://download.pytorch.org/whl/torch_stable.html

# Magic, not sure why cupy-cuda111 would not work, it seems that cupy-cuda111 will use different PTX from torch.
pip3 install cupy-cuda110==8.6.0

Other dependencies:

pip3 install datasets==2.2.2
pip3 install transformers==4.19.2
pip3 install sentencepiece==0.1.96 # required by deberta

Setup network configuration:

export GLOO_SOCKET_IFNAME=ens3
export NCCL_SOCKET_IFNAME=ens3

Run Distributed Gpipe:

Partition the pre-trained model:

# gpt2
python convert_gpt2_checkpoint --model-name gpt2-xl --save-dir checkpoints/
    
# or deberta 
python convert_deberta_checkpoint --model-name deberta-v2-xxl --save-dir checkpoints/

On each node, run:

# gpt2
python dist_lm_runner.py $(echo ${ARGS}) --cuda-id 0 --rank i # (i=0,...,N-1)
    
# or deberta
python dist_deberta_runner.py $(echo ${ARGS}) --cuda-id 0 --rank i # (i=0,...,N-1)

where "ARGS" contains training-related configurations, which should remain the same across nodes. An example could be:

ARGS="--model-name checkpoints/gpt2-xl \
  --tokenizer-name gpt2-xl \
  --load-pretrained-model true \
  --task-name wikitext --n-epochs 10 --warmup-epochs 1 \
  --num-layers 6 --num-heads 25 --embedding-dim 1600 \
  --num-iters 10000000 --lr 5e-5 --seq-length 1024 --batch-size 32 --micro-batch-size 1 \
  --forward-compress-method delta \
  --forward-bits 4 \
  --backward-compress-method fixpoint \
  --backward-bits 8 \
  --dist-url tcp://XXX.XXX.XXX.XXX:9000 \
  --world-size N --pipeline-group-size N \
  --pp-mode gpipe --profiling no-profiling --do-evaluation true"

Modify "--dist-url", "--world-size" and "--pipeline-group-size" before running.

Complete examples can be found "./run_lm.sh" and "./run_deberta.sh".

Arguments

Distributed Related

"--dist-url": tcp://XXX.XXX.XXX.XXX:9000
"--world-size": number of nodes that participate in the training.
"--pipeline-group-size": number of nodes that perform pipeline parallelism.
"--data-group-size": number of nodes that perform data parallelism.
"--rank": the rank of the current node. (0, ..., world_size-1)
"--profiling": "no-profiling" or "tidy_profiling". If "tidy_profiling", a trace file will be generated in "./trace_json/", which can be visualized with "chrome://tracing/".

Compression Related

"--forward-compress-method": "none", "fixpoint", "delta", or "delta-lowbits".
- "none": do not compress.
- "fixpoint": direct compress the activations. need to specify `"--forward-bits".
- "delta": compress and communicate the delta of activations. need to specify "--forward-bits" and "--max-activation-cache-size".
- "delta-lowbits": in addition to "delta", it also compresses the local cache (previous activations). need to specify "--forward-bits", "--forward-bits-act", and "--max-activation-cache-size".
"--backward-compress-method": "none" or "fixpoint".
- "none": do not compress.
- "fixpoint": direct compress the gradients. need to specify "--backward-bits".

Training Related

"--batch-size": macro-batch size.
"--micro-batch-size ": micro-batch-size. The macro-batch size should be divisible by micro-batch-size.
"--lr": the peak learning rate.
"--n-epochs": number of training epochs.
"--warmup-epochs": number of epochs for uncompressed training (transfer full-precision activations and gradients).
"--warmup-steps": number of training steps where the learning rate grows from 0 to "--lr". Default to be one training epoch.
"--do-evaluation": whether do evaluation during training.

Model Related

"--model-name": Name or path of the pretrained checkpoint. Usually should be a path to the checkpoint generated by "convert_xxx_checkpoint.py".
"--tokenizer-name": Name or path of the tokenizer.
"--load-pretrained-model": whether to load the pretrained checkpoint. The checkpoint should be generated by "convert_xxx_checkpoint.py".
"--num-layers", "--num-heads", "--embedding-dim" should be inline with the configuration of "--model-name".

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
comm		comm
compress		compress
data_parallel		data_parallel
metrics		metrics
modules		modules
optimizer		optimizer
pipeline_parallel		pipeline_parallel
tasks		tasks
trace_json		trace_json
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_deberta_checkpoint.py		convert_deberta_checkpoint.py
convert_gpt2_checkpoint.py		convert_gpt2_checkpoint.py
dist_deberta_runner.py		dist_deberta_runner.py
dist_lm_runner.py		dist_lm_runner.py
run_deberta.sh		run_deberta.sh
run_lm.sh		run_lm.sh

License

DS3Lab/AC-SGD

Folders and files

Latest commit

History

Repository files navigation

Activation Compression with Guarantees

Cite Our Paper

Setup:

Run Distributed Gpipe:

Arguments

Distributed Related

Compression Related

Training Related

Model Related

About

Resources

License

Stars

Watchers

Forks

Languages