Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improved handling of a large Parquet rowgroup #10761

Open
jlowe opened this issue May 2, 2024 · 0 comments
Open

[FEA] Improved handling of a large Parquet rowgroup #10761

jlowe opened this issue May 2, 2024 · 0 comments
Labels
feature request New feature or request performance A performance related task/issue

Comments

@jlowe
Copy link
Member

jlowe commented May 2, 2024

Is your feature request related to a problem? Please describe.
Currently if a task tries to load a very large rowgroup, either in terms of large number of rows and/or large number of columns, we leverage the sub-rowgroup reader in libcudf to read the rowgroup in batches. However because the on-GPU state of the sub-rowgroup reader is opaque and not spillable, we must iterate to fully load the rowgroup, making each resulting sub-rowgroup batch spillable as we go, to free the on-GPU state of the sub-rowgroup reader and finally proceed with processing the first batch returned from the read.

This works fine in practice when the GPU has enough memory to hold the entire rowgroup without spilling. If it does not, this can perform poorly due to excessive spilling. This case should be handled better.

Describe the solution you'd like
When there isn't enough GPU memory, we could load a single batch via the sub-rowgroup reader and then close the reader to free the GPU state. We then send the batch down the stage iterators for processing. When it's time to produce the next input batch, we create a new sub-rowgroup reader instance but this time pass a starting row offset matching the last row we left off from the previous batch. This allows us to process a subset of the rowgroup at a time without needing to manifest the entire rowgroup data at once, potentially spilling heavily during the process. The downside of course is that we will redundantly transfer and decode some column pages to the GPU, but this may be much faster overall than spilling since it can avoid hitting disks.

Describe alternatives you've considered
Another approach would be to make the GPU sub-rowgroup state spillable, but this would be a more involved approach requiring changes to the libcudf sub-rowgroup reader.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels May 2, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants