Data state restoration #63

TimotheeMickus · 2024-03-16T07:40:13Z

Currently, the --train_from option does not include means of restoring corpora states, hence training resumes from the beginning of the bitexts. This entails resumed models are training on a subset of the available data, unless some manual shuffling is done between each resumption.

Fixes (partially implemented on V2, but never ported here

add a line index as files are read, pass it along when collating batches, and skip up til this line index upon training resumption
a more complex refactoring is needed to save the dataloader state, as it would involve communicating all examples in the reservoir — which would be much more costly to communicate.

CC @jrvc

The text was updated successfully, but these errors were encountered:

TimotheeMickus added the enhancement New feature or request label Mar 16, 2024

TimotheeMickus mentioned this issue Mar 26, 2024

data state restoration #64

Merged

TimotheeMickus closed this as completed in #64 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data state restoration #63

Data state restoration #63

TimotheeMickus commented Mar 16, 2024 •

edited

Data state restoration #63

Data state restoration #63

Comments

TimotheeMickus commented Mar 16, 2024 • edited

TimotheeMickus commented Mar 16, 2024 •

edited