it takes too long for DynamicBucketingSampler to load state dict #1327

Mahaotian1 · 2024-04-23T07:39:52Z

When I retrained 30,000 hours of data from checkpoint, it took a long time to load state dict for DynamicBucketingSampler(more than 2 hours).It's it normal ?

here is my code:

train_sampler = DynamicBucketingSampler(
         cuts_train,
         max_duration=self.args.max_duration,
         shuffle=self.args.shuffle,
         buffer_size=self.args.buffer_size,                 # 40000
         shuffle_buffer_size=self.args.shuffle_buffer_size, # 100000
         quadratic_duration=10,
         num_cuts_for_bins_estimate=10000,
         drop_last=True,)
logging.info("Loading sampler state dict")
train_sampler.load_state_dict(sampler_state_dict)

The text was updated successfully, but these errors were encountered:

pzelasko · 2024-04-23T14:14:05Z

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

Mahaotian1 · 2024-04-24T00:39:15Z

Unfortunately, yes. Restoring state of the sampler is unfortunately quite tricky to do quickly, and I don’t recommend using this technique with large data. Instead, it’s easier to discard the sampler state and change the random seed to randomize the training data.

Thank you for your reply. I have another question I would like to ask, the question is that during the training of large scale data, I use load_manifest_lazy to read the data and take every batch on it, will it cause the cpu memory to be full?

pzelasko · 2024-04-24T17:02:52Z

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

Mahaotian1 · 2024-04-25T00:31:32Z

No, CPU RAM usage should be bounded by buffer_size setting in the sampler.

Why does the cpu memory continue to increase during training until it is full？ Is it the problem of h5file？ How can I free up memory？

pzelasko · 2024-04-25T03:26:03Z

Are you using HDF5 files? We have a workaround fix in ASR dataset class but IIRC it only slows down the memory leak. You can try to use Lhotse Shar format instead, or LilcomChunkyWriter which are free from these issues. For large data, Lhotse Shar is recommended as it is much more io efficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it takes too long for DynamicBucketingSampler to load state dict #1327

it takes too long for DynamicBucketingSampler to load state dict #1327

Mahaotian1 commented Apr 23, 2024

pzelasko commented Apr 23, 2024

Mahaotian1 commented Apr 24, 2024

pzelasko commented Apr 24, 2024

Mahaotian1 commented Apr 25, 2024

pzelasko commented Apr 25, 2024

it takes too long for DynamicBucketingSampler to load state dict #1327

it takes too long for DynamicBucketingSampler to load state dict #1327

Comments

Mahaotian1 commented Apr 23, 2024

pzelasko commented Apr 23, 2024

Mahaotian1 commented Apr 24, 2024

pzelasko commented Apr 24, 2024

Mahaotian1 commented Apr 25, 2024

pzelasko commented Apr 25, 2024