Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoints missing optimizer_states #45

Open
rahm-hopkins opened this issue Mar 18, 2022 · 1 comment
Open

Checkpoints missing optimizer_states #45

rahm-hopkins opened this issue Mar 18, 2022 · 1 comment

Comments

@rahm-hopkins
Copy link

Thank you for your work on this very useful library!

I have had success training Albert Unbiased from scratch. I'm curious how model performance would compare if training continued from one of your checkpoints (unbiased-albert-c8519128.ckpt in this case). However if I attempt to initiate train.py with this file I am getting an error like:

KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'

FYI I am using the following command:

python train.py --config configs/Unintended_bias_toxic_comment_classification_Albert_revised_training.json -d 1 --num_workers 0 -e 101 -r model_ckpts/unbiased-albert-c8519128_modified_state_dict.ckpt

Inspecting the checkpoint file I indeed observe it is missing some components, most critical of which (I think) is the optimizer_states. Comparing to one of my own checkpoints it looks like what is absent includes: ['pytorch-lightning_version', 'callbacks', 'optimizer_states', 'lr_schedulers', 'hparams_name', 'hyper_parameters'].

I'm wondering if I am doing something wrong? Or else, is it possible for you to share new versions of your checkpoints that include these missing components?

@laurahanu
Copy link
Collaborator

Hello!

Yes, we only saved the weights to keep the files small since the optimizer state is not needed for prediction. If you used the same data and training instructions the full checkpoint should be the same as what you have, which you could check by running the model on the test set.

Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants