Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webdataset reader behavior with many sources #5429

Open
1 task done
evgeniishch opened this issue Apr 16, 2024 · 1 comment
Open
1 task done

Webdataset reader behavior with many sources #5429

evgeniishch opened this issue Apr 16, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@evgeniishch
Copy link

Describe the question.

nvidia.dali.fn.readers.webdataset supports reading from multiple tar files, specified as a list of paths

How is reading from multiple sources performed? Are all sources read sequentially one after another?
What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

Thank you

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@evgeniishch evgeniishch added the question Further information is requested label Apr 16, 2024
@JanuszL
Copy link
Contributor

JanuszL commented Apr 16, 2024

Hi @evgeniishch,

Thank you for reaching out.
Answering your questions:

How is reading from multiple sources performed? Are all sources read sequentially one after another?

Abstracting away sharding (where each pipeline is assigned to a separate, non-overlapping shard of data) reading is done in sequence in each pipeline.

What happens when random_shuffle parameter is set to True? Are samples drawn to buffer from one source or from all sources with some distribution?

DALI uses an internal buffer of fixed size (initial_fill parameter) where data is read sequentially, and then when the batch is created this buffer is randomly sampled. The expectation from the data sets in containers (RecordIO, TFRecord, or webdataset) is they are preschuffled to avoid grouping samples belonging to one class so the first batch may have a very small representation (regarding classes) compared to the whole dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants