Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it takes too long for DynamicBucketingSampler to load state dict #1327

Open
Mahaotian1 opened this issue Apr 23, 2024 · 5 comments
Open

it takes too long for DynamicBucketingSampler to load state dict #1327

Mahaotian1 opened this issue Apr 23, 2024 · 5 comments

Comments

@Mahaotian1
Copy link

When I retrained 30,000 hours of data from checkpoint, it took a long time to load state dict for DynamicBucketingSampler(more than 2 hours).It's it normal ?

here is my code:

train_sampler = DynamicBucketingSampler(
         cuts_train,
         max_duration=self.args.max_duration,
         shuffle=self.args.shuffle,
         buffer_size=self.args.buffer_size,                 # 40000
         shuffle_buffer_size=self.args.shuffle_buffer_size, # 100000
         quadratic_duration=10,
         num_cuts_for_bins_estimate=10000,
         drop_last=True,)
logging.info("Loading sampler state dict")
train_sampler.load_state_dict(sampler_state_dict)
@pzelasko
Copy link
Collaborator

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

@Mahaotian1
Copy link
Author

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

Thank you for your reply. I have another question I would like to ask, the question is that during the training of large scale data, I use load_manifest_lazy to read the data and take every batch on it, will it cause the cpu memory to be full?

@pzelasko
Copy link
Collaborator

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

@Mahaotian1
Copy link
Author

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

Why does the cpu memory continue to increase during training until it is full? Is it the problem of h5file? How can I free up memory?

@pzelasko
Copy link
Collaborator

Are you using HDF5 files? We have a workaround fix in ASR dataset class but IIRC it only slows down the memory leak. You can try to use Lhotse Shar format instead, or LilcomChunkyWriter which are free from these issues. For large data, Lhotse Shar is recommended as it is much more io efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants