Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Selecting Training Hyper-Parameters And Model Initializations

Glossary

Training jargon uses a multitude of abbreviations and terms, so here are some important for this chapter.

BS: Batch Size - here we mean batch size per gpu, often it is also referred to as MBS (micro-batch-size)
GBS: Global Batch Size - total batch size per iteration - may include gradient accumulation
GAS: Gradient Accumulation Steps - how many forward/backward cycles to perform before one full iteration is complete
TFLOPs: Trillion FLOPs per second - FLOPS
PP: Pipeline Parallelism

Global Batch Size Ramp Up

If you intend to train with a very large GBS, with say 1024, or 2048 samples and even higher, when you just start training, it's very wasteful to feed such large batch sizes to the model. At this point it's totally random and can't benefit from having too refined data. Therefore to save data and resources, one often ramps up the global batch size over some period of time.

It's also important to not start with GBS that is too small, since otherwise the progress won't be efficient. When there is too little data the compute (TFLOPS) is inefficient and will slow everything down. This is especially so when Pipeline Parallelism (PP) is used, since the most important thing about PP tuneup is a small GPU idleness bubble, and the smaller the GBS the larger the bubble is.

For example, for BLOOM-176B, where we did use PP, after doing throughput benchmarking we found that starting with GBS=16 was incredibly slow (8 TFLOPs), so we eventually started with GBS=192 (73 TFLOPs) and then we ramped up to GBS=2048 (150 TFLOPs) - we increased GBS by 16 every 9_765_625 samples.

STD Init

This hyper parameter is super-important and it requires math to get it right. For details see STD Init.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hparams

hparams

README.md

README.md

README.md

Selecting Training Hyper-Parameters And Model Initializations

Glossary

Global Batch Size Ramp Up

STD Init

Files

hparams

Directory actions

More options

Directory actions

More options

Latest commit

History

hparams

Folders and files

parent directory

README.md

README.md

README.md

Selecting Training Hyper-Parameters And Model Initializations

Glossary

Global Batch Size Ramp Up

STD Init