[FEA] Improved handling of a large Parquet rowgroup #10761

jlowe · 2024-05-02T16:36:02Z

Is your feature request related to a problem? Please describe.
Currently if a task tries to load a very large rowgroup, either in terms of large number of rows and/or large number of columns, we leverage the sub-rowgroup reader in libcudf to read the rowgroup in batches. However because the on-GPU state of the sub-rowgroup reader is opaque and not spillable, we must iterate to fully load the rowgroup, making each resulting sub-rowgroup batch spillable as we go, to free the on-GPU state of the sub-rowgroup reader and finally proceed with processing the first batch returned from the read.

This works fine in practice when the GPU has enough memory to hold the entire rowgroup without spilling. If it does not, this can perform poorly due to excessive spilling. This case should be handled better.

Describe the solution you'd like
When there isn't enough GPU memory, we could load a single batch via the sub-rowgroup reader and then close the reader to free the GPU state. We then send the batch down the stage iterators for processing. When it's time to produce the next input batch, we create a new sub-rowgroup reader instance but this time pass a starting row offset matching the last row we left off from the previous batch. This allows us to process a subset of the rowgroup at a time without needing to manifest the entire rowgroup data at once, potentially spilling heavily during the process. The downside of course is that we will redundantly transfer and decode some column pages to the GPU, but this may be much faster overall than spilling since it can avoid hitting disks.

Describe alternatives you've considered
Another approach would be to make the GPU sub-rowgroup state spillable, but this would be a more involved approach requiring changes to the libcudf sub-rowgroup reader.

jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels May 2, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improved handling of a large Parquet rowgroup #10761

[FEA] Improved handling of a large Parquet rowgroup #10761

jlowe commented May 2, 2024

[FEA] Improved handling of a large Parquet rowgroup #10761

[FEA] Improved handling of a large Parquet rowgroup #10761

Comments

jlowe commented May 2, 2024