Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

Open
winningsix opened this issue May 10, 2024 · 6 comments
Open

[FEA] Improve Performance of GPU shuffle on Celeborn #10790

winningsix opened this issue May 10, 2024 · 6 comments
Assignees
Labels
feature request New feature or request

Comments

@winningsix
Copy link
Collaborator

winningsix commented May 10, 2024

Is your feature request related to a problem? Please describe.
To achieve better stability, remote shuffle becomes a new technology trend. And Uniffle and Celeborn are most widely used options in PRC side. Beginning with Celeborn, we should have a good support with GPU acceleration for the normal shuffle path.

Describe the solution you'd like
As for client side, the shuffle is bypassing the sort like normal shuffle, we can directly make partition, serialization and compression on GPU other than host side per batch.

Performance wise, we want to have 20X performance gain (op time) against single CPU core of recent a few generations.

Feature scope wise, we want to:
(1) Move shuffle partition, serialization, compression onto GPU. And the targeted compression codec is about ZSTD. (#10841)
(2) Based (1), it could seamless work with vanilla Celeborn shuffle manager. But it involves one memory copy from native to Java. One alternative is to leverage pushData or mergeData but in native way to reduce extra memory copy.
(3) Introduce a heuristic based approach based on compression ratio. The initial state could be either CPU or GPU shuffle. By analyzing 1st coming batch, it could calculate the compression ratio. If the compression ratio of the 1st few batches is above the compression ratio threshold. It will use the GPU based approach other wise it uses CPU based.

Non-Goal is to include encryption at the rest support.

Describe alternatives you've considered
Celeborn can work seamless without moving thing on GPU. Thus, CPU based implementation will be another alternative.

@winningsix winningsix added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 10, 2024
@zhanglistar
Copy link

Great!

@firestarman
Copy link
Collaborator

firestarman commented May 17, 2024

Celeborn works as a normal Spark shuffle manager, so Plugin always works well with it.
This issue is just to track if enabling GPU slicing and compression can get better perf when working with Celeborn. Or at least get better perf from some customer queries.

@waitinfuture
Copy link

waitinfuture commented May 17, 2024

Hi, as a committer from Celeborn community, I'd like to help if any features are required from Celeborn, and you're always welcome to contribute to Celeborn :)

@firestarman
Copy link
Collaborator

Really appreciate that @waitinfuture. Will let you know if any action is required from Celeborn side.

@firestarman firestarman changed the title [FEA] Support GPU shuffle for Celeborn [FEA] Support GPU slicing and serialization for the normal Shuffle path May 20, 2024
@firestarman firestarman changed the title [FEA] Support GPU slicing and serialization for the normal Shuffle path [FEA] Support GPU shuffle for Celeborn May 20, 2024
@revans2 revans2 changed the title [FEA] Support GPU shuffle for Celeborn [FEA] Improve Performance of GPU shuffle on Celeborn May 21, 2024
@revans2
Copy link
Collaborator

revans2 commented May 21, 2024

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 21, 2024
@winningsix
Copy link
Collaborator Author

I updated the name of this issue to make it clear. Our shuffle works with Celeborn, but the goal here is to improve the performance of that shuffle.

@firestarman and @winningsix If we have patches that improve the performance could you please explain how GPU compression and slicing improves the performance? In the past we tried to do compression on the GPU as a part of shuffle and the performance was generally worse because of the opportunity cost. Generally the CPU was mostly idle waiting for the GPU to finish, and by offloading the shuffle data to the CPU for compression improved the performance, especially if we could do the compression using multiple CPU threads. I really would like to understand how this improves performance, what tests have been run so we can know which situations should enable this and which should not.

Thanks for the title update. It looks more suitable. The benefiting point comes from a notable compression ratio (saying 3 ~ 10) in customer queries. If higher CR, less time spent in device-to-host. We should introduce a heuristic approach to determine GPU shuffle based on the data pattern. For current case, I would suggest to start with compression ratio per batch as the determining point. For example, if the compression ratio > 3 for 1st batch handled by current executor, it will sit on GPU shuffle, otherwise CPU shuffle compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants