Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please support prefetch with python datasets #5323

Open
bionicles opened this issue Mar 14, 2024 · 2 comments
Open

Please support prefetch with python datasets #5323

bionicles opened this issue Mar 14, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@bionicles
Copy link

bionicles commented Mar 14, 2024

Is your feature request related to a problem? Please describe.
There's a tremendous performance difference between datasets which are fully tensor end-to-end and datasets where some data wrangling happens in Python.

I was hoping to use "prefetch" to prepare data with the CPU while the GPU does work, but unfortunately, this only works if the data preparation is fully tensorflow-ish (? not sure the right term here)

Python IO operations are often exponentially slower and act as a bottleneck and prevent computers from keeping accelerators working at capacity.

Describe the solution you'd like
I wish tf.data.Dataset prefetch was more broadly compatible with prefetching of non-tensorflow vanilla-python data preparations.

Would it be possible for prefetch to use some performant C++ to sidestep Python GIL issues and juggle python data wrangling CPU processes alongside GPU training / inference without depending on such python CPU work happening in the main python driver process? I just want to be able to prefetch custom python datasets. Often there's some prep involved, not every dataset is tensorflow end-to-end.

i.e. instead of (python)->(gpu) what if it were

(python)->(cpp)
(cpp)->(python_prefetch)
(cpp)->(gpu/tpu accelerator)

Since C++ lacks a GIL, it could just run the python generator in a process which is isolated from the main python driver process' GIL, you'd still have a GIL per generator, but that's an easy fix, just run more python processes with different random seeds etc.

Describe alternatives you've considered
Torch DataLoader could be an option but it also seems to be a python-driven solution and therefore not super performant. I tried Threading, but pickling and unpickling overhead in python can be pretty bad. I think maybe C++ could run a background python process to prefetch python data in a more performant way than python could.

Additional context
Broader accessibility of custom dataset prefetching could enable new use-cases for tf.data especially in prototyping or infinite search spaces where it might not make sense to convert entire datasets to tensors in advance.

Apologies if I misunderstand the intricacies involved. I just want to prefetch datasets built from generators. I tried doing this last week but it didn't work, so hopefully I'm not mentioning an issue which is fixed and I missed a way to pull it off. It's hard to provide an example since the code in question is closed-source and quite extensive anyway. For a good example of when this might be handy, consider RL gym envs or datasets which involve making GET requests.

@bionicles bionicles added the enhancement New feature or request label Mar 14, 2024
@tomvdw
Copy link
Collaborator

tomvdw commented Mar 15, 2024

Do I assume correctly that you're using tfds.data_source to load the data? If so, one option is to use Grain to load your data. IIUC Grain does prefetch the data. If you're doing random access, prefetching is hard because you don't know what record will be loaded next.

@bionicles
Copy link
Author

Thank you @tomvdw i will check that out! Just eyeballing it, one way to make grain more accessible would be to add usage examples to the readme so folks who look at the repo can get the gestalt. I tried clicking the docs link on the GitHub iOS app and it took me to a folder of code, so I'll take a look in there.

I meekly suggest code examples above the fold are great advertising for any repo. Thank you for sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants