-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Improve Performance of GPU shuffle on Celeborn #10790
Comments
Great! |
Celeborn works as a normal Spark shuffle manager, so Plugin always works well with it. |
Hi, as a committer from Celeborn community, I'd like to help if any features are required from Celeborn, and you're always welcome to contribute to Celeborn :) |
Really appreciate that @waitinfuture. Will let you know if any action is required from Celeborn side. |
I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle. @firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not. |
Thanks for the title update. It looks more suitable. The benefiting point comes from a notable compression ratio (saying 3 ~ 10) in customer queries. If higher CR, less time spent in device-to-host. We should introduce a heuristic approach to determine GPU shuffle based on the data pattern. For current case, I would suggest to start with compression ratio per batch as the determining point. For example, if the compression ratio > 3 for 1st batch handled by current executor, it will sit on GPU shuffle, otherwise CPU shuffle compression. |
Is your feature request related to a problem? Please describe.
To achieve better stability, remote shuffle becomes a new technology trend. And Uniffle and Celeborn are most widely used options in PRC side. Beginning with Celeborn, we should have a good support with GPU acceleration for the normal shuffle path.
Describe the solution you'd like
As for client side, the shuffle is bypassing the sort like normal shuffle, we can directly make partition, serialization and compression on GPU other than host side per batch.
Performance wise, we want to have 20X performance gain (op time) against single CPU core of recent a few generations.
Feature scope wise, we want to:
(1) Move shuffle partition, serialization, compression onto GPU. And the targeted compression codec is about ZSTD. (#10841)
(2) Based (1), it could seamless work with vanilla Celeborn shuffle manager. But it involves one memory copy from native to Java. One alternative is to leverage pushData or mergeData but in native way to reduce extra memory copy.
(3) Introduce a heuristic based approach based on compression ratio. The initial state could be either CPU or GPU shuffle. By analyzing 1st coming batch, it could calculate the compression ratio. If the compression ratio of the 1st few batches is above the compression ratio threshold. It will use the GPU based approach other wise it uses CPU based.
Non-Goal is to include encryption at the rest support.
Describe alternatives you've considered
Celeborn can work seamless without moving thing on GPU. Thus, CPU based implementation will be another alternative.
The text was updated successfully, but these errors were encountered: