LAMBrrr

This package provides an efficient implementation of the LAMB optimizer in PyTorch 2. The optimizer is horizontally fused using torch._foreach ops and vertically fused using torch.compile and Triton. This improves performance compared with existing implementations, especially for models with a large number of parameter tensors. 10% faster than existing PyTorch implementations, but still slower than Nvidia apex's FusedLAMB [4].

The optimizer was tested using a Turing Nvidia GPU, but should run on all devices supported by Triton.

Note: the formulation of the LAMB optimizer changed throughout versions of the original paper. This package implements the latest formulation at the time of writing (v5).

Why LAMB?

LAMB enables models to be trained using larger batch sizes than ADAM. This is useful for large models to better exploit 3d/pipeline parallelism or for contrastive models - among other purposes.

Install

Install using pip:

pip install https://github.com/nopperl/pytorch-fused-lamb

The package requires torch>=2.2.0 and triton>=2.21, which need to be installed seperately, e.g. using pip:

pip install torch triton

Usage

Import the optimizer class:

from pytorch_fused_lamb import Lamb

The Lamb optimizer can be used like any other torch.optim.Optimizer. For a PyTorch model:

-from torch.optim import Adam
+from pytorch_fused_lamb import Lamb
from torch import nn

dataset = ...
model = ...
-optimizer = Adam(model.parameters(), lr=1e-3)
+optimizer = Lamb(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
for (x, y) in dataset:
    optimizer.zero_grad()
    outputs = model(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

HuggingFace

-from torch.optim import Adam
+from pytorch_fused_lamb import Lamb
from torch import nn
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

train_dataset = ...
eval_dataset = ...

-optimizer = Adam(model.parameters(), lr=1e-3)
+optimizer = Lamb(model.parameters(), lr=1e-3)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    optimizers=(optimizer, None)
)

# Start training
trainer.train()

Benchmark

Tested with torch==2.4.0a0+git1346ebf, triton==3.0.0 and CUDA 11.8.

Comparison with a reference implementation [3] and the nvidia/apex FusedLAMB [4] optimizer (only supports CUDA) is in benchmark.ipynb.

Results:

[------------  -----------]
                 |  runtime
1 threads: ----------------
      reference  |    26.3 
      apex       |    10.4 
      compiled   |    23.6 

Times are in milliseconds (ms).

TODO

selectively exclude layers from lr adaptation (exclude_from_layer_adaptation)
weight norm clipping
grad norm clipping
trust ratio clipping (LAMBC)
compare convergence with Adam and LAMB with 4x batch size (requires more GPU resources)
AdamW update
Adagrad update

References

Original paper (version 5): https://arxiv.org/abs/1904.00962v5
Official TensorFlow implementation: https://github.com/tensorflow/addons/blob/3380b3ccf906f20dbef49c05906e6b6dabf479cf/tensorflow_addons/optimizers/lamb.py
PyTorch implementation used as reference: https://github.com/huggingface/pytorch-image-models/blob/59b3d86c1d69fe85ccb5bbdb2f1711eadae6e4a7/timm/optim/lamb.py
fused implementation of LAMB in CUDA: https://github.com/NVIDIA/apex/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_lamb.py
Official PyTorch implementation of Adam: https://github.com/pytorch/pytorch/blob/1ea2f1eaa19176b8b5b778bb317203dc7f4fd3dc/torch/optim/adam.py
Benchmarking vertical fusion of torch optimizers: https://dev-discuss.pytorch.org/t/compiling-the-optimizer-with-pt2/1669

Other implementations

https://github.com/HabanaAI/Model-References/blob/2b435114fe8e31f159b1d3063b8280ae37af7423/PyTorch/nlp/bert/pretraining/lamb.py
https://github.com/skyday123/pytorch-lamb
https://github.com/Leiwx52/Pytorch-LAMB
https://github.com/kozistr/pytorch_optimizer (implements v3, i.e. no debiasing)
https://github.com/cybertronai/pytorch-lamb (computed values do not correspond with official implementation, see cybertronai/pytorch-lamb#10), used in:
- https://pytorch-optimizer.readthedocs.io/en/latest/_modules/torch_optimizer/lamb.html
- https://lightning-flash.readthedocs.io/en/0.5.0/_modules/flash/core/optimizers/lamb.html

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pytorch_fused_lamb		pytorch_fused_lamb
tests		tests
LICENSE		LICENSE
README.md		README.md
benchmark.ipynb		benchmark.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch_fused_lamb

pytorch_fused_lamb

tests

tests

LICENSE

LICENSE

README.md

README.md

benchmark.ipynb

benchmark.ipynb

setup.py

setup.py

Repository files navigation

LAMBrrr

Why LAMB?

Install

Usage

HuggingFace

Benchmark

TODO

References

Other implementations

About

Releases

Packages

Languages

License

nopperl/pytorch-fused-lamb

Folders and files

Latest commit

History

Repository files navigation

LAMBrrr

Why LAMB?

Install

Usage

HuggingFace

Benchmark

TODO

References

Other implementations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages