You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently if a task tries to load a very large rowgroup, either in terms of large number of rows and/or large number of columns, we leverage the sub-rowgroup reader in libcudf to read the rowgroup in batches. However because the on-GPU state of the sub-rowgroup reader is opaque and not spillable, we must iterate to fully load the rowgroup, making each resulting sub-rowgroup batch spillable as we go, to free the on-GPU state of the sub-rowgroup reader and finally proceed with processing the first batch returned from the read.
This works fine in practice when the GPU has enough memory to hold the entire rowgroup without spilling. If it does not, this can perform poorly due to excessive spilling. This case should be handled better.
Describe the solution you'd like
When there isn't enough GPU memory, we could load a single batch via the sub-rowgroup reader and then close the reader to free the GPU state. We then send the batch down the stage iterators for processing. When it's time to produce the next input batch, we create a new sub-rowgroup reader instance but this time pass a starting row offset matching the last row we left off from the previous batch. This allows us to process a subset of the rowgroup at a time without needing to manifest the entire rowgroup data at once, potentially spilling heavily during the process. The downside of course is that we will redundantly transfer and decode some column pages to the GPU, but this may be much faster overall than spilling since it can avoid hitting disks.
Describe alternatives you've considered
Another approach would be to make the GPU sub-rowgroup state spillable, but this would be a more involved approach requiring changes to the libcudf sub-rowgroup reader.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently if a task tries to load a very large rowgroup, either in terms of large number of rows and/or large number of columns, we leverage the sub-rowgroup reader in libcudf to read the rowgroup in batches. However because the on-GPU state of the sub-rowgroup reader is opaque and not spillable, we must iterate to fully load the rowgroup, making each resulting sub-rowgroup batch spillable as we go, to free the on-GPU state of the sub-rowgroup reader and finally proceed with processing the first batch returned from the read.
This works fine in practice when the GPU has enough memory to hold the entire rowgroup without spilling. If it does not, this can perform poorly due to excessive spilling. This case should be handled better.
Describe the solution you'd like
When there isn't enough GPU memory, we could load a single batch via the sub-rowgroup reader and then close the reader to free the GPU state. We then send the batch down the stage iterators for processing. When it's time to produce the next input batch, we create a new sub-rowgroup reader instance but this time pass a starting row offset matching the last row we left off from the previous batch. This allows us to process a subset of the rowgroup at a time without needing to manifest the entire rowgroup data at once, potentially spilling heavily during the process. The downside of course is that we will redundantly transfer and decode some column pages to the GPU, but this may be much faster overall than spilling since it can avoid hitting disks.
Describe alternatives you've considered
Another approach would be to make the GPU sub-rowgroup state spillable, but this would be a more involved approach requiring changes to the libcudf sub-rowgroup reader.
The text was updated successfully, but these errors were encountered: