PyTorch Distributed Load Updates or Returns `state_dict` #125096

mvpatel2000 · 2024-04-27T05:07:24Z

🚀 The feature, motivation and pitch

Torch distributed checkpoint load_state_dict (https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_loader.py#L20)
updates the passed in state_dict (and returns it). This function is deprecated in torch 2.3 in favor of load (https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_loader.py#L48) which neither returns nor updates the passed in state_dict. Instead, it only calls load_state_dict on Stateful elements in the specified state_dict.

Unfortunately, this new API is greatly limiting. For example, in Composer's state_dict passed for checkpointing, we also store various RNG tensors in a dict for determinism. In order to use the new API, we have to rewrap everything in a Stateful class, which is a somewhat pointless abstraction. Instead, we prefer to receive a loaded state_dict and then manually call load_state_dict on appropriate subitems.

Can we modify load to update the passed in state_dict? This would entail adding:

state_dict[key] = statetful_sd[key]

after https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_loader.py#L172-L177

Alternatives

No response

Additional context

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

The text was updated successfully, but these errors were encountered:

mvpatel2000 · 2024-04-27T05:08:12Z

@pytorchbot label "oncall: distributed"

fegin · 2024-05-01T20:45:41Z

@LucasLLC Any thought about this request?

LucasLLC · 2024-05-07T15:07:13Z

Sorry for the delay! Taking a look

LucasLLC · 2024-05-07T15:08:32Z

@mvpatel2000 this makes a lot of sense to me. Would you like to submit a PR or should I?

mvpatel2000 · 2024-05-07T18:16:40Z

@mvpatel2000 this makes a lot of sense to me. Would you like to submit a PR or should I?

@LucasLLC if you can submit a PR that would be awesome :)

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 27, 2024

Skylion007 added the module: distributed_checkpoint label Apr 27, 2024

fegin assigned LucasLLC May 1, 2024

LucasLLC added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Distributed Load Updates or Returns `state_dict` #125096

PyTorch Distributed Load Updates or Returns `state_dict` #125096

mvpatel2000 commented Apr 27, 2024 •

edited by pytorch-bot bot

mvpatel2000 commented Apr 27, 2024

fegin commented May 1, 2024

LucasLLC commented May 7, 2024

LucasLLC commented May 7, 2024

mvpatel2000 commented May 7, 2024

PyTorch Distributed Load Updates or Returns state_dict #125096

PyTorch Distributed Load Updates or Returns state_dict #125096

Comments

mvpatel2000 commented Apr 27, 2024 • edited by pytorch-bot bot

🚀 The feature, motivation and pitch

Alternatives

Additional context

mvpatel2000 commented Apr 27, 2024

fegin commented May 1, 2024

LucasLLC commented May 7, 2024

LucasLLC commented May 7, 2024

mvpatel2000 commented May 7, 2024

PyTorch Distributed Load Updates or Returns `state_dict` #125096

PyTorch Distributed Load Updates or Returns `state_dict` #125096

mvpatel2000 commented Apr 27, 2024 •

edited by pytorch-bot bot