Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]when saving multiple epochs add an epoch number suffix for when save best=False #597

Closed
Quetzalcohuatl opened this issue Jan 31, 2024 · 3 comments · Fixed by #721
Assignees
Labels
type/feature Feature request

Comments

@Quetzalcohuatl
Copy link
Contributor

🚀 Feature

Saves multiple .pth on each checkpoint. Instead of overwriting every checkpoint.pth

Motivation

Often useful to see how model performs at each epoch/savepoint. For example when training llm, want to measure the generative capabilities after each epoch and see if it is improving

@Quetzalcohuatl Quetzalcohuatl added the type/feature Feature request label Jan 31, 2024
@Quetzalcohuatl
Copy link
Contributor Author

Example: after epoch 1 it saves checkpoint_ep01.pth

after epoch 2 it saves checkpoint_ep02.pth

when loading mode back in according to config, it by default will load in sorted(glob(“checkpoint_ep*”))[-1] aka the last epoch to keep the behavior the same as it currently is

alternatively if save_best_only=true, then keep the current behavior of saving as checkpoint.pth ?

@psinger
Copy link
Collaborator

psinger commented Jan 31, 2024

We didnt do that by default as model weights take a ton of disk space.

We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?

@Quetzalcohuatl
Copy link
Contributor Author

We didnt do that by default as model weights take a ton of disk space.

We could theoretically make it a separate setting to additionally save all checkpoints, wdyt?

Most research papers are only training for 1 epoch, sometimes 2. If the user knows what theyre doing and wants to enable it, I think its a nice option. Especially since its a simple implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants