[FEA] Improve Performance of GPU shuffle on Celeborn #10790

winningsix · 2024-05-10T03:34:27Z

Is your feature request related to a problem? Please describe.
To achieve better stability, remote shuffle becomes a new technology trend. And Uniffle and Celeborn are most widely used options in PRC side. Beginning with Celeborn, we should have a good support with GPU acceleration for the normal shuffle path.

Describe the solution you'd like
As for client side, the shuffle is bypassing the sort like normal shuffle, we can directly make partition, serialization and compression on GPU other than host side per batch.

Performance wise, we want to have 20X performance gain (op time) against single CPU core of recent a few generations.

Feature scope wise, we want to:
(1) Move shuffle partition, serialization, compression onto GPU. And the targeted compression codec is about ZSTD. (#10841)
(2) Based (1), it could seamless work with vanilla Celeborn shuffle manager. But it involves one memory copy from native to Java. One alternative is to leverage pushData or mergeData but in native way to reduce extra memory copy.
(3) Introduce a heuristic based approach based on compression ratio. The initial state could be either CPU or GPU shuffle. By analyzing 1st coming batch, it could calculate the compression ratio. If the compression ratio of the 1st few batches is above the compression ratio threshold. It will use the GPU based approach other wise it uses CPU based.

Non-Goal is to include encryption at the rest support.

Describe alternatives you've considered
Celeborn can work seamless without moving thing on GPU. Thus, CPU based implementation will be another alternative.

zhanglistar · 2024-05-13T06:43:16Z

Great!

firestarman · 2024-05-17T03:32:44Z

Celeborn works as a normal Spark shuffle manager, so Plugin always works well with it.
This issue is just to track if enabling GPU slicing and compression can get better perf when working with Celeborn. Or at least get better perf from some customer queries.

waitinfuture · 2024-05-17T03:57:39Z

Hi, as a committer from Celeborn community, I'd like to help if any features are required from Celeborn, and you're always welcome to contribute to Celeborn :)

firestarman · 2024-05-17T05:47:44Z

Really appreciate that @waitinfuture. Will let you know if any action is required from Celeborn side.

revans2 · 2024-05-21T14:11:40Z

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

winningsix · 2024-05-28T06:24:58Z

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

Thanks for the title update. It looks more suitable. The benefiting point comes from a notable compression ratio (saying 3 ~ 10) in customer queries. If higher CR, less time spent in device-to-host. We should introduce a heuristic approach to determine GPU shuffle based on the data pattern. For current case, I would suggest to start with compression ratio per batch as the determining point. For example, if the compression ratio > 3 for 1st batch handled by current executor, it will sit on GPU shuffle, otherwise CPU shuffle compression.

winningsix added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 10, 2024

This was referenced May 15, 2024

Support serializing packed tables directly for the normal shuffle path #10818

Draft

Support zstd for GPU shuffle compression #10824

Merged

firestarman changed the title ~~[FEA] Support GPU shuffle for Celeborn~~ [FEA] Support GPU slicing and serialization for the normal Shuffle path May 20, 2024

firestarman changed the title ~~[FEA] Support GPU slicing and serialization for the normal Shuffle path~~ [FEA] Support GPU shuffle for Celeborn May 20, 2024

firestarman mentioned this issue May 20, 2024

[BUG] Support GPU slicing and compression for the normal Shuffle path #10841

Open

revans2 changed the title ~~[FEA] Support GPU shuffle for Celeborn~~ [FEA] Improve Performance of GPU shuffle on Celeborn May 21, 2024

mattahrens assigned winningsix May 21, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

winningsix commented May 10, 2024 •

edited

zhanglistar commented May 13, 2024

firestarman commented May 17, 2024 •

edited

waitinfuture commented May 17, 2024 •

edited

firestarman commented May 17, 2024

revans2 commented May 21, 2024

winningsix commented May 28, 2024

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

Comments

winningsix commented May 10, 2024 • edited

zhanglistar commented May 13, 2024

firestarman commented May 17, 2024 • edited

waitinfuture commented May 17, 2024 • edited

firestarman commented May 17, 2024

revans2 commented May 21, 2024

winningsix commented May 28, 2024

winningsix commented May 10, 2024 •

edited

firestarman commented May 17, 2024 •

edited

waitinfuture commented May 17, 2024 •

edited