Replies: 1 comment 1 reply
-
@JkSelf At a high level, it makes sense to optimize rank <= N and dense_rank <= N queries. However, there are quite a few details to sort out. Would you create a Google doc to describe the proposed design and implementation in detail? Specifically, the number of top rows that must be kept is quite different for these 3 functions. row_number <= 3 requires keeping only 3 top rows. However, it is not enough to keep 3 top rows for rank <= 3 or dense_rank <= 3.
It seems wasteful to sort all the data in this case. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
After Gluten was upgraded to Spark version 3.5, Spark 3.5 introduced the RankLimit operator here, which optimizes the performance of the rank, dense_rank, and row_number functions. It extracts only the top N data within each WindowPartition, and then in the window operator, it is only necessary to compute the top N data for each Partition without needing to process all the data. This approach not only improves performance but also reduces the risk of out-of-memory (OOM) issues when memory is constrained. Therefore, we plan to also introduce support for the RankLimit operator in Gluten.
Currently, to implement the RankLimit operator in Gluten, we need to address the following two issues:
@mbasmanova @aditi-pandit @zhouyuan @ayushi-agarwal @PHILO-HE @rui-mo
Beta Was this translation helpful? Give feedback.
All reactions