Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Resuming Training gives CheckpointMismatchError #1310

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

pablo-sanchez-sony
Copy link

@pablo-sanchez-sony pablo-sanchez-sony commented Aug 28, 2023

Link to the relevant Bug(s)

This Pull request is linked to #1300. It has been resolved by modifying the file src/pykeen/training/training_loop.py.

I've also updated minor things in the following two tracker files:

Description of the Change

The problem comes when using the scheduler object from PyTorch. We can observe in the constructor whenever last_epoch=-1 the initial_lr of the optimizer is updated.

https://github.com/pytorch/pytorch/blob/a5d841ef01e615e2a654fb12cf0cd08697d12ccf/torch/optim/lr_scheduler.py#L38

Basically, this makes str(self.optimizer).encode("utf-8") to be different, given that we have not yet reloaded the optimizer nor the scheduler.

I believe the issue can be solved by moving the checksum comparison to the end of the method _load_state

 if checkpoint["checksum"] != self.checksum: 
     raise CheckpointMismatchError( 
         f"The checkpoint file '{path}' that was provided already exists, but seems to be " 
         f"from a different training loop setup.", 
     ) 

Other changes

Possible Drawbacks

Verification Process

I checked the introduced changes worked in my particular case.

Release Notes

  • Fixed an issue in which loading a Checkpoint with an LR scheduler was giving an error because the learning rate was not properly loaded.
  • Fixed an issue in which the JSONResultTracker was not able to dump python objects
  • Fixed an issue in which allow_val_change of wandb Tracker was a dummy variable.

@cthoyt
Copy link
Member

cthoyt commented Aug 28, 2023

Please add tests that illustrate where this might be applicable

@cthoyt cthoyt added the 🛑 Checkpoints issues related to checkpoints and resuming training label Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🛑 Checkpoints issues related to checkpoints and resuming training
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants