Unable to continue training from checkpoint. #138

brunosan · 2024-01-26T11:10:34Z

I am trying to run some more training loops for a specific region, using this notebook.

I was not happy with the clustering:

So I wanted to run a few epochs only on my target areag.

When I do so, with

!python trainer.py fit --trainer.max_epochs=100 \
                       --data.data_dir=data/chips \
                       --ckpt_path=data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt

I get this error:

Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Total number of chips: 1102
/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/brunosan/code/Clay/model/checkpoints exists and is not empty.
Restoring states from the checkpoint path at data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type | Params
-------------------------------
0 | model | CLAY | 127 M 
-------------------------------
127 M     Trainable params
0         Non-trainable params
127 M     Total params
510.809   Total estimated model params size (MB)
Traceback (most recent call last):
  File "/home/brunosan/code/Clay/model/trainer.py", line 77, in <module>
    cli_main()
  File "/home/brunosan/code/Clay/model/trainer.py", line 64, in cli_main
    cli = LightningCLI(
          ^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 386, in __init__
    self._run_subcommand(self.subcommand)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
    fn(**fn_kwargs)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    self._checkpoint_connector.restore_training_state()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 296, in restore_training_state
    self.restore_optimizers_and_schedulers()
  File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 362, in restore_optimizers_and_schedulers
    raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'

The text was updated successfully, but these errors were encountered:

yellowcap · 2024-02-05T10:26:58Z

Looks like we are indeed only saving the weights. Not sure if that means we can not continue training or if there is a workaround. @weiji14 and @srmsoumya ?

model/trainer.py

Line 50 in b8aa8cd

save_weights_only=True,

weiji14 · 2024-02-11T23:59:01Z

Yeah, we did not save the AdamW optimizer state, so it won't be possible to resume training from that checkpoint using the AdamW optimizer, or any adaptive optimization algorithms. It might be possible to resume training using non-adaptive optimizers such as Stochastic Gradient Descent, but it would require a lot of manual handling of the checkpoint loading, so not a straightforward workaround.

That said, the original objective seems to be on finetuning the checkpoint on a specific region, rather than resuming the self-supervised training. The entrypoint shouldn't be trainer.py, but a separate finetuning script (which could technically still use elements of the MAE training loop).

brunosan · 2024-03-01T23:50:22Z

Main use case is to resume training if halted (e.g. we were using Spot instances), but I can see use cases where a regional user might want to continue training with regional data.

If we chose not to save the optimizers, we should document how to resume training with new initialized optimizers.

yellowcap · 2024-03-04T09:37:31Z

I agree we should have a way to resume training for the checkpoints we save (or at least the last one), if that is technically possible and won't slow down training too much.

yellowcap · 2024-03-15T11:04:32Z

We have addressed this for v0.2 and will also for v1, by storing the optimizer state during training. So I am closing this, but feel free to reopen if this is an issue that persists for future versions of the model.

brunosan · 2024-03-25T12:36:45Z

Not saving the optimizer remains the default.

model/trainer.py

Line 51 in 50094ba

save_weights_only=True,

Closes #138

srmsoumya · 2024-03-26T05:31:17Z

Addressed in PR #193

brunosan assigned yellowcap Jan 26, 2024

brunosan added a commit that referenced this issue Mar 2, 2024

save full checkpoint. ref #138

42dc6fd

yellowcap closed this as completed Mar 15, 2024

brunosan reopened this Mar 25, 2024

brunosan added a commit that referenced this issue Mar 25, 2024

Save optimizer by default.

709715b

Closes #138

brunosan linked a pull request Mar 25, 2024 that will close this issue

Save optimizer by default. #191

Open

srmsoumya mentioned this issue Mar 26, 2024

Fixes issue with model reconstruction of the upper half of the image & saves model checkpoint in s3 #193

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to continue training from checkpoint. #138

Unable to continue training from checkpoint. #138

brunosan commented Jan 26, 2024

yellowcap commented Feb 5, 2024

weiji14 commented Feb 11, 2024

brunosan commented Mar 1, 2024

yellowcap commented Mar 4, 2024

yellowcap commented Mar 15, 2024

brunosan commented Mar 25, 2024

srmsoumya commented Mar 26, 2024

Unable to continue training from checkpoint. #138

Unable to continue training from checkpoint. #138

Comments

brunosan commented Jan 26, 2024

yellowcap commented Feb 5, 2024

weiji14 commented Feb 11, 2024

brunosan commented Mar 1, 2024

yellowcap commented Mar 4, 2024

yellowcap commented Mar 15, 2024

brunosan commented Mar 25, 2024

srmsoumya commented Mar 26, 2024