Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data state restoration #64

Merged
merged 9 commits into from May 22, 2024
Merged

data state restoration #64

merged 9 commits into from May 22, 2024

Conversation

TimotheeMickus
Copy link
Collaborator

@TimotheeMickus TimotheeMickus commented Mar 26, 2024

closes #63 .

same idea as v2, didn't bother porting from it.

does not embark bucket states, although this could maybe be done by picking the line indices from all examples in the reservoir. Would come at a (potentially high) communication overhead cost.

@TimotheeMickus TimotheeMickus marked this pull request as ready for review May 21, 2024 07:46
@TimotheeMickus
Copy link
Collaborator Author

also embarks a minor debug on update_vocab being only partially removed (following #61)

@@ -162,13 +184,13 @@ def _cast(example_dict):
if self.transforms is not None else lambda x: x
),
stride=self.stride,
offset=self.offset,
offset=offset,
)
examples = map(_cast, examples)
yield from examples

# FIXME: some RNN archs require sorting src's by length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of this PR concenrs... do we still support RNNs? if not, we can just delete this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we no longer support RNNs. the move is to externalize transformer variantsvia lucidrains (see #56). Will remove the comment

@@ -95,12 +96,14 @@ def save(self, step, moving_average=None):
self._rm_checkpoint(todel)
self.checkpoint_queue.append(chkpt_names)

def _save(self, step):
def _save(self, step, save_model, data_state, device_context):
"""Save a resumable checkpoint.

Args:
step (int): step number
model (nn.Module): torch model to save
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing: save_model (type): description

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think it was renamed from model to save_model? not sure why. will fix that.

Copy link
Member

@jrvc jrvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@TimotheeMickus TimotheeMickus merged commit dc01039 into main May 22, 2024
2 checks passed
@TimotheeMickus TimotheeMickus deleted the feats/data-restoration branch May 22, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data state restoration
2 participants