Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to continue training from checkpoint. #138

Open
brunosan opened this issue Jan 26, 2024 · 7 comments · May be fixed by #191
Open

Unable to continue training from checkpoint. #138

brunosan opened this issue Jan 26, 2024 · 7 comments · May be fixed by #191
Assignees

Comments

@brunosan
Copy link
Member

I am trying to run some more training loops for a specific region, using this notebook.

I was not happy with the clustering:
Screenshot 2024-01-26 at 12 05 35

So I wanted to run a few epochs only on my target areag.

When I do so, with

!python trainer.py fit --trainer.max_epochs=100 \
                       --data.data_dir=data/chips \
                       --ckpt_path=data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt 

I get this error:

Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Total number of chips: 1102
/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/brunosan/code/Clay/model/checkpoints exists and is not empty.
Restoring states from the checkpoint path at data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type | Params
-------------------------------
0 | model | CLAY | 127 M 
-------------------------------
127 M     Trainable params
0         Non-trainable params
127 M     Total params
510.809   Total estimated model params size (MB)
Traceback (most recent call last):
  File "/home/brunosan/code/Clay/model/trainer.py", line 77, in <module>
    cli_main()
  File "/home/brunosan/code/Clay/model/trainer.py", line 64, in cli_main
    cli = LightningCLI(
          ^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 386, in __init__
    self._run_subcommand(self.subcommand)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
    fn(**fn_kwargs)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    self._checkpoint_connector.restore_training_state()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 296, in restore_training_state
    self.restore_optimizers_and_schedulers()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 362, in restore_optimizers_and_schedulers
    raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'
@yellowcap
Copy link
Member

Looks like we are indeed only saving the weights. Not sure if that means we can not continue training or if there is a workaround. @weiji14 and @srmsoumya ?

save_weights_only=True,

@weiji14
Copy link
Contributor

weiji14 commented Feb 11, 2024

Yeah, we did not save the AdamW optimizer state, so it won't be possible to resume training from that checkpoint using the AdamW optimizer, or any adaptive optimization algorithms. It might be possible to resume training using non-adaptive optimizers such as Stochastic Gradient Descent, but it would require a lot of manual handling of the checkpoint loading, so not a straightforward workaround.

That said, the original objective seems to be on finetuning the checkpoint on a specific region, rather than resuming the self-supervised training. The entrypoint shouldn't be trainer.py, but a separate finetuning script (which could technically still use elements of the MAE training loop).

@brunosan
Copy link
Member Author

brunosan commented Mar 1, 2024

Main use case is to resume training if halted (e.g. we were using Spot instances), but I can see use cases where a regional user might want to continue training with regional data.

If we chose not to save the optimizers, we should document how to resume training with new initialized optimizers.

brunosan added a commit that referenced this issue Mar 2, 2024
@yellowcap
Copy link
Member

I agree we should have a way to resume training for the checkpoints we save (or at least the last one), if that is technically possible and won't slow down training too much.

@yellowcap
Copy link
Member

We have addressed this for v0.2 and will also for v1, by storing the optimizer state during training. So I am closing this, but feel free to reopen if this is an issue that persists for future versions of the model.

@brunosan
Copy link
Member Author

Not saving the optimizer remains the default.

save_weights_only=True,

@srmsoumya
Copy link
Collaborator

Addressed in PR #193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants