Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of the load_checkpoint function #650

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

Xirider
Copy link
Contributor

@Xirider Xirider commented Feb 14, 2023

This is a rewrite of how we determine which checkpoint to load when starting/restarting a training run. (Originally there was also a refactor of how our different checkpoint paths are processed, but I seperated this out for now)

Previously we had a quite brittle logic for this, with edge cases where metaseq would not load the correct checkpoint. For example here: #544

The new logic checks first all possible sources for checkpoints (restore-file, finetune-from, local checkpoints, nfs / azure checkpoints), and assigns them priority based on their progress in training and prefers local caches.

It then takes the most recent checkpoint and copies it to local disk.

To test this you need both metaseq / metaseq-internal PR's. Here: https://github.com/fairinternal/metaseq-internal/pull/842

I tested:

  • multi-node and single node local start
  • with and without nfs cloud upload path
  • crashing a run and correct continuing
  • with and without finetune-from
  • with and without resume-file
  • evals

What I didn't test yet, is if starting with an azure blob path is working.

@Xirider
Copy link
Contributor Author

Xirider commented Feb 14, 2023

Will test this on azure soon.

@suchenzang
Copy link
Contributor

Heads up: #646 will likely go in first since tests are passing there (after loss parity check is added). There will probably be merge conflicts after, but hopefully not too bad.

Copy link
Contributor

@suchenzang suchenzang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on getting the checkpoint saving/uploading tests to pass (may need to modify to adjust to new logic, just need to be clear why the modification to the test is needed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants