Webdataset reader behavior with many sources #5429

evgeniishch · 2024-04-16T11:23:12Z

Describe the question.

nvidia.dali.fn.readers.webdataset supports reading from multiple tar files, specified as a list of paths

How is reading from multiple sources performed? Are all sources read sequentially one after another?
What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

Thank you

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

JanuszL · 2024-04-16T11:53:18Z

Hi @evgeniishch,

Thank you for reaching out.
Answering your questions:

How is reading from multiple sources performed? Are all sources read sequentially one after another?

Abstracting away sharding (where each pipeline is assigned to a separate, non-overlapping shard of data) reading is done in sequence in each pipeline.

What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

DALI uses an internal buffer of fixed size (initial_fill parameter) where data is read sequentially, and then when the batch is created this buffer is randomly sampled. The expectation from the data sets in containers (RecordIO, TFRecord, or webdataset) is they are preschuffled to avoid grouping samples belonging to one class so the first batch may have a very small representation (regarding classes) compared to the whole dataset.

evgeniishch added the question Further information is requested label Apr 16, 2024

dali-automaton assigned JanuszL Apr 16, 2024

JanuszL mentioned this issue Apr 22, 2024

webdataset cannot stop cycling at end of epoch #5441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webdataset reader behavior with many sources #5429

Webdataset reader behavior with many sources #5429

evgeniishch commented Apr 16, 2024

JanuszL commented Apr 16, 2024 •

edited

Webdataset reader behavior with many sources #5429

Webdataset reader behavior with many sources #5429

Comments

evgeniishch commented Apr 16, 2024

Describe the question.

Check for duplicates

JanuszL commented Apr 16, 2024 • edited

JanuszL commented Apr 16, 2024 •

edited