How to support rank and dense_rank functions in TopNRowNumber? #9404

JkSelf · 2024-04-07T07:56:04Z

JkSelf
Apr 7, 2024

After Gluten was upgraded to Spark version 3.5, Spark 3.5 introduced the RankLimit operator here, which optimizes the performance of the rank, dense_rank, and row_number functions. It extracts only the top N data within each WindowPartition, and then in the window operator, it is only necessary to compute the top N data for each Partition without needing to process all the data. This approach not only improves performance but also reduces the risk of out-of-memory (OOM) issues when memory is constrained. Therefore, we plan to also introduce support for the RankLimit operator in Gluten.

Currently, to implement the RankLimit operator in Gluten, we need to address the following two issues:

At present, Velox's TopNRowNumber has already implemented similar optimizations for the row_number function, but not yet for rank and dense_rank. What is the reason for this? We have reviewed the code and believe that TopNRowNumber is fully capable of supporting rank and dense_rank. At the time of TopNRowNumber#getOutput, we can create corresponding WindowFunctions based on different function names, and then have different WindowPartitions apply these WindowFunctions to derive the final computation results. Do you think this solution is feasible?
Similar to the Window operator, Spark adds a Sort operator before RankLimit to sort the data according to the partition key and order by key. Therefore, within TopNRowNumber, there is no need to sort the data again. We need to implement an operator similar to StreamingWindow to remove the sorting operation from TopNRowNumber.

@mbasmanova @aditi-pandit @zhouyuan @ayushi-agarwal @PHILO-HE @rui-mo

mbasmanova · 2024-04-08T13:03:04Z

mbasmanova
Apr 8, 2024
Collaborator

@JkSelf At a high level, it makes sense to optimize rank <= N and dense_rank <= N queries. However, there are quite a few details to sort out. Would you create a Google doc to describe the proposed design and implementation in detail?

Specifically, the number of top rows that must be kept is quite different for these 3 functions. row_number <= 3 requires keeping only 3 top rows. However, it is not enough to keep 3 top rows for rank <= 3 or dense_rank <= 3.

Spark adds a Sort operator before RankLimit to sort the data according to the partition key and order by key.

It seems wasteful to sort all the data in this case.

1 reply

JkSelf Apr 24, 2024
Author

@ayushi-agarwal apache/incubator-gluten#5398 already merged. Can you help to follow this task? Thanks for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support rank and dense_rank functions in TopNRowNumber? #9404

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to support rank and dense_rank functions in TopNRowNumber? #9404

JkSelf Apr 7, 2024

Replies: 1 comment · 1 reply

mbasmanova Apr 8, 2024 Collaborator

JkSelf Apr 24, 2024 Author

JkSelf
Apr 7, 2024

Replies: 1 comment 1 reply

mbasmanova
Apr 8, 2024
Collaborator

JkSelf Apr 24, 2024
Author