06 Nov 18:19

Borda

2f1756f

Minor patch release v2.1.1

App

Added

add flow fail() (#18883)

Fixed

Fix failing lightning cli entry point (#18821)

Fabric

Changed

Calling a method other than forward that invokes submodules is now an error when the model is wrapped (e.g., with DDP) (#18819)

Fixed

Fixed false-positive warnings about method calls on the Fabric-wrapped module (#18819)
Refined the FSDP saving logic and error messaging when the path exists (#18884)
Fixed layer conversion under Fabric.init_module() context manager when using the BitsandbytesPrecision plugin (#18914)

PyTorch

Fixed

Fixed an issue when replacing an existing last.ckpt file with a symlink (#18793)
Fixed an issue when BatchSizeFinder steps_per_trial parameter ends up defining how many validation batches to run during the entire training (#18394)
Fixed an issue saving the last.ckpt file when using ModelCheckpoint on a remote filesystem, and no logger is used (#18867)
Refined the FSDP saving logic and error messaging when the path exists (#18884)
Fixed an issue parsing the version from folders that don't include a version number in TensorBoardLogger and CSVLogger (#18897)

Contributors

@awaelchli, @Borda, @BoringDonut, @carmocca, @hiaoxui, @ioangatop, @nohalon, @rasbt, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Full Changelog: 2.1.0...2.1.1

Contributors

awaelchli, rasbt, and 7 other contributors

Assets 10

12 Oct 13:10

awaelchli

2.1.0

6f6c07d

Lightning 2.1: Train Bigger, Better, Faster

Lightning AI is excited to announce the release of Lightning 2.1 ⚡ It's the culmination of work from 79 contributors who have worked on features, bug-fixes, and documentation for a total of over 750+ commits since v2.0.

The theme of 2.1 is "bigger, better, faster": Bigger because training large multi-billion parameter models has gotten even more efficient thanks to FSDP, efficient initialization and sharded checkpointing improvements, better because it's easier than ever to scale models without making substantial code changes or installing third-party packages and faster because it leverages the latest hardware features to speed up training in low-bit precision thanks to new precision plugins like bitsandbytes and transformer engine.
And of course, as the name implies, this release fully leverages the latest features in PyTorch 2.1 🎉

Highlights

Improvements To Large-Scale Training With FSDP

The FSDP strategy for training large billion-parameter models gets substantial improvements and new features in Lightning 2.1, both in Trainer and Fabric (in case you didn't know, Fabric is the latest addition to the Lightning family of tools to scale models without the boilerplate code).
FSDP is now more user-friendly to configure, has memory management and speed improvements, and we have a brand new end-to-end user guide with best practices (Trainer, Fabric).

Efficient Saving and Loading of Large Checkpoints

When training large billion-parameter models with FSDP, saving and resuming training, or even just loading model parameters for finetuning can be challenging, as users are are often plagued by out-of-memory errors and speed bottlenecks.

In 2.1, we made several improvements. Starting with saving checkpoints, we added support for distributed/sharded checkpoints, enabled through the setting state_dict_type in the strategy (#18364, #18358):

Trainer:

import lightning as L
from lightning.pytorch.strategies import FSDPStrategy

# Default used by the strategy
strategy = FSDPStrategy(state_dict_type="full")

# Enable saving distributed checkpoints
strategy = FSDPStrategy(state_dict_type="sharded")

trainer = L.Trainer(strategy=strategy, ...)

Fabric:

import lightning as L
from lightning.fabric.strategies import FSDPStrategy

# Saving distributed checkpoints is the default
strategy = FSDPStrategy(state_dict_type="sharded")

# Save consolidated (single file) checkpoints
strategy = FSDPStrategy(state_dict_type="full")

fabric = L.Fabric(strategy=strategy, ...)

Distributed checkpoints are the fastest and most memory efficient way to save the state of very large models.
The distributed checkpoint format also makes it efficient to load these checkpoints back for resuming training in parallel, and it reduces the impact on CPU memory usage significantly. Furthermore, we've also introduced lazy-loading for non-distributed checkpoints (#18150, #18379), which greatly reduces the impact on CPU memory usage when loading a consolidated (single-file) checkpoint (e.g. for finetuning). Learn more about these features in our FSDP guides (Trainer, Fabric).

Fast and Memory-Optimized Initialization

A major challenge that users face when working with large models such as LLMs is dealing with the extreme memory requirements. Even something as simple as instantiating a model becomes non-trivial if the model is so large it won't fit in a single GPU or even a single machine. In Lightning 2.1, we are introducing empty-weights initialization through the Fabric.init_module() (#17462, #17627) and Trainer.init_module()/LightningModule.configure_model() (#18004, #18004, #18385) methods:

Trainer:

import lightning as L

class MyModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        # Delay initialization of model to `configure_model()`

    def configure_model(self):
        # Model initialized in correct precision and weights on meta-device
        self.model = ...

    ...

trainer = L.Trainer(strategy="fsdp", ...)
trainer.fit(model)

Fabric:

import lightning as L

fabric = L.Fabric(strategy="fsdp", ...)

# Model initialized in correct precision and weights on meta-device
with fabric.init_module(empty_init=True):
    model = ...
    

# You can also initialize buffers and tensors directly on device and dtype
with fabric.init_tensor():
    model.mask.create()
    model.kv_cache.create()
    x = torch.randn(4, 128)

# Materialization and sharding of model happens inside here
model = fabric.setup(model)

Read more about this new feature and its other benefits in our docs (Trainer, Fabric).

User-Friendly Configuration

We made it super easy to configure the sharding- and activation-checkpointing policy when you want to auto-wrap particular layers of your model for advanced control (#18045, #18084).

  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.wrap import ModuleWrapPolicy

- strategy = FSDPStrategy(auto_wrap_policy=ModuleWrapPolicy({MyTransformerBlock}))
+ strategy = FSDPStrategy(auto_wrap_policy={MyTransformerBlock})
  trainer = L.Trainer(strategy=strategy, ...)

Furthermore, the sharding strategy can now be conveniently set with a string value (#18087):

  import lightning as L
  from lightning.pytorch.strategies import FSDPStrategy
- from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy

- strategy = FSDPStrategy(sharding_strategy=ShardingStrategy.SHARD_GRAD_OP)
+ strategy = FSDPStrategy(sharding_strategy="SHARD_GRAD_OP")
  trainer = L.Trainer(strategy=strategy, ...)

You no longer need to remember the long PyTorch imports! Fabric also supports all these improvements shown above.

True Half-Precision

Lightning now supports true half-precision for training and inference with all built-in strategies (#18193, #18217, #18213, #18219). With this setting, the memory required to store the model weights is only half of what is normally needed when running with float32. In addition, you get the same speed benefits as mixed precision training (precision="16-mixed") has:

import lightning as L

# default
trainer = L.Trainer(precision="32-true")

# train with model weights in `torch.float16`
trainer = L.Trainer(precision="16-true")

# train with model weights in `torch.bfloat16`
# (if hardware supports it)
trainer = L.Trainer(precision="bf16-true")

The same settings are also available in Fabric! We recommend to try bfloat16 training (precision="bf16-true") as it is often more numerically stable than regular 16-bit precision (`precisi...

Contributors

nicolai86, lantiga, and 88 other contributors

Assets 10

10 Oct 08:15

Borda

2.1.0.rc1

4ad8bdb

Feature teaser Pre-release

Pre-release

🐰

Assets 10

28 Sep 18:47

Borda

2.0.9.post0

528aaa2

Hotfix for Conda package

2.0.9.post0

releasing 2.0.9.post0

Assets 10

14 Sep 19:22

Borda

2.0.9

2a0af04

Weekly patch release

App

Fixed

Replace LightningClient with import from lightning_cloud (#18544)

Fabric

Fixed

Fixed an issue causing the _FabricOptimizer.state to remain outdated after loading with load_state_dict (#18488)

PyTorch

Fixed

Fixed an issue that wouldn't prevent the user to set the log_model parameter in WandbLogger via the LightningCLI (#18458)
Fixed the display of v_num in the progress bar when running with Trainer(fast_dev_run=True) (#18491)
Fixed UnboundLocalError when running with python -O (#18496)
Fixed visual glitch with the TQDM progress bar leaving the validation bar incomplete before switching back to the training display (#18503)
Fixed false positive warning about logging interval when running with Trainer(fast_dev_run=True) (#18550)

Contributors

@awaelchli, @Borda, @justusschock, @SebastianGer

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 2 other contributors

Assets 10

30 Aug 12:29

Borda

2.0.8

8345689

Weekly patch release

App

Changed

Change top folder (#18212)
Remove _handle_is_headless calls in app run loop (#18362)

Fixed

refactor path to root preventing circular import (#18357)

Fabric

Changed

On XLA, avoid setting the global rank before processes have been launched as this will initialize the PJRT computation client in the main process (#16966)

Fixed

Fixed model parameters getting shared between processes when running with strategy="ddp_spawn" and accelerator="cpu"; this has a necessary memory impact, as parameters are replicated for each process now (#18238)
Removed false positive warning when using fabric.no_backward_sync with XLA strategies (#17761)
Fixed issue where Fabric would not initialize the global rank, world size, and rank-zero-only rank after initialization and before launch (#16966)
Fixed FSDP full-precision param_dtype training (16-mixed, bf16-mixed and 32-true configurations) to avoid FSDP assertion errors with PyTorch < 2.0 (#18278)

PyTorch

Changed

On XLA, avoid setting the global rank before processes have been launched as this will initialize the PJRT computation client in the main process (#16966)
Fix inefficiency in rich progress bar (#18369)

Fixed

Fixed FSDP full-precision param_dtype training (16-mixed and bf16-mixed configurations) to avoid FSDP assertion errors with PyTorch < 2.0 (#18278)
Fixed an issue that prevented the use of custom logger classes without an experiment property defined (#18093)
Fixed setting the tracking uri in MLFlowLogger for logging artifacts to the MLFlow server (#18395)
Fixed redundant iter() call to dataloader when checking dataloading configuration (#18415)
Fixed model parameters getting shared between processes when running with strategy="ddp_spawn" and accelerator="cpu"; this has a necessary memory impact, as parameters are replicated for each process now (#18238)
Properly manage fetcher.done with dataloader_iter (#18376)

Contributors

@awaelchli, @Borda, @carmocca, @quintenroets, @rlizzo, @speediedan, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 5 other contributors

Assets 10

16 Aug 07:30

Borda

2.0.7

34c3fb9

Weekly patch release

App

Changed

Removed the top-level import lightning.pdb; import lightning.app.pdb instead (#18177)
Client retries forever (#18065)

Fixed

Fixed an issue that would prevent the user to set the multiprocessing start method after importing lightning (#18177)

Fabric

Changed

Disabled the auto-detection of the Kubeflow environment (#18137)

Fixed

Fixed issue where DDP subprocesses that used Hydra would set hydra's working directory to current directory (#18145)
Fixed an issue that would prevent the user to set the multiprocessing start method after importing lightning (#18177)
Fixed an issue with Fabric.all_reduce() not performing an inplace operation for all backends consistently (#18235)

PyTorch

Added

Added LightningOptimizer.refresh() to update the __dict__ in case the optimizer it wraps has changed its internal state (#18280)

Changed

Disabled the auto-detection of the Kubeflow environment (#18137))

Fixed

Fixed a Missing folder exception when using a Google Storage URL as a default_root_dir (#18088)
Fixed an issue that would prevent the user to set the multiprocessing start method after importing lightning (#18177)
Fixed the gradient unscaling logic if the training step skipped backward (by returning None) (#18267)
Ensure that the closure running inside the optimizer step has gradients enabled, even if the optimizer step has it disabled (#18268)
Fixed an issue that could cause the LightningOptimizer wrapper returned by LightningModule.optimizers() have different internal state than the optimizer it wraps (#18280)

Contributors

@0x404, @awaelchli, @bilelomrani1, @Borda, @ethanwharris, @nisheethlahoti

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

nisheethlahoti, awaelchli, and 4 other contributors

Assets 10

24 Jul 21:36

Borda

2.0.6

ecffb2a

Minor patch release

2.0.6

App

Fixed handling a None request in the file orchestration queue (#18111)

Fabric

Fixed TensorBoardLogger.log_graph not unwrapping the _FabricModule (#17844)

PyTorch

LightningCLI not saving correctly seed_everything when run=True and seed_everything=True (#18056)
Fixed validation of non-PyTorch LR schedulers in manual optimization mode (#18092)
Fixed an attribute error for _FaultTolerantMode when loading an old checkpoint that pickled the enum (#18094)

Contributors

@awaelchli, @lantiga, @mauvilsa, @shihaoyin

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

lantiga, awaelchli, and 2 other contributors

Assets 10

10 Jul 16:09

Borda

2.0.5

e819a81

Minor patch release

App

Added

plugin: store source app (#17892)
added colocation identifier (#16796)
Added exponential backoff to HTTPQueue put (#18013)
Content for plugins (#17243)

Changed

Save a reference to created tasks, to avoid tasks disappearing (#17946)

Fabric

Added

Added validation against misconfigured device selection when using the DeepSpeed strategy (#17952)

Changed

Avoid info message when loading 0 entry point callbacks (#17990)

Fixed

Fixed the emission of a false-positive warning when calling a method on the Fabric-wrapped module that accepts no arguments (#17875)
Fixed check for FSDP's flat parameters in all parameter groups (#17914)
Fixed automatic step tracking in Fabric's CSVLogger (#17942)
Fixed an issue causing the torch.set_float32_matmul_precision info message to show multiple times (#17960)
Fixed loading model state when Fabric.load() is called after Fabric.setup() (#17997)

PyTorch

Fixed

Fixed delayed creation of experiment metadata and checkpoint/log dir name when using WandbLogger (#17818)
Fixed incorrect parsing of arguments when augmenting exception messages in DDP (#17948)
Fixed an issue causing the torch.set_float32_matmul_precision info message to show multiple times (#17960)
Added missing map_location argument for the LightningDataModule.load_from_checkpoint function (#17950)
Fix support for neptune-client (#17939)

Contributors

@anio, @awaelchli, @Borda, @ethanwharris, @lantiga, @nicolai86, @rjarun8, @schmidt-ai, @schuhschuh, @wouterzwerink, @yurijmikhalevich

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

nicolai86, lantiga, and 9 other contributors

Assets 10

22 Jun 18:23

Borda

2.0.4

9d749ec

Minor patch release

App

Fixed

bumped several dependencies to address security vulnerabilities.

Fabric

Fixed

Fixed validation of parameters of plugins.precision.MixedPrecision (#17687)
Fixed an issue with HPU imports leading to performance degradation (#17788)

PyTorch

Changed

Changes to the NeptuneLogger (#16761):
- It now supports neptune-client 0.16.16 and neptune >=1.0, and we have replaced the log() method with append() and extend().
- It now accepts a namespace Handler as an alternative to Run for the run argument. This means that you can call it NeptuneLogger(run=run["some/namespace"]) to log everything to the some/namespace/ location of the run.

Fixed

Fixed validation of parameters of plugins.precision.MixedPrecisionPlugin (#17687)
Fixed deriving default map location in LightningModule.load_from_checkpoint when there is an extra state (#17812)

Contributors

@akreuzer, @awaelchli, @Borda, @jerome-habana, @kshitij12345

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Contributors

awaelchli, Borda, and 3 other contributors

Assets 10

Releases: Lightning-AI/pytorch-lightning

Minor patch release v2.1.1

App

Added

Fixed

Fabric

Changed

Fixed

PyTorch

Fixed

Contributors

Contributors

Lightning 2.1: Train Bigger, Better, Faster

Highlights

Improvements To Large-Scale Training With FSDP

Efficient Saving and Loading of Large Checkpoints

Fast and Memory-Optimized Initialization

User-Friendly Configuration

True Half-Precision

Contributors

Feature teaser

Hotfix for Conda package

Weekly patch release

App

Fixed

Fabric

Fixed

PyTorch

Fixed

Contributors

Contributors

Weekly patch release

App

Changed

Fixed

Fabric

Changed

Fixed

PyTorch

Changed

Fixed

Contributors

Contributors

Weekly patch release

App

Changed

Fixed

Fabric

Changed

Fixed

PyTorch

Added

Changed

Fixed

Contributors

Contributors

Minor patch release

2.0.6

App

Fabric

PyTorch

Contributors

Contributors

Minor patch release

App

Added

Changed

Fabric

Added

Changed

Fixed

PyTorch

Fixed

Contributors

Contributors

Minor patch release

App

Fixed

Fabric

Fixed