[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

winningsix · 2024-05-13T07:04:19Z

Is your feature request related to a problem? Please describe.
In Spark, distinct will introduce an expand operator before aggregation. And the things done there are mostly around populating some null columns. However, in current implementation, it shows significant performance issue (e.g., 9X longer time than doing this on 16 CPU cores). We need to catch up performance at least similar level performance.

Describe the solution you'd like
#10560 was already mentioning an approach there. Besides that, from nsys trace, lots of making null columns bring significant negative impacts to overall performance. Also other optimizations came up from @binmahone around introducing a post coalesce after expand to increase batch size for aggregation.

winningsix added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 13, 2024

winningsix changed the title ~~[FEA] Optimize count distinct performance optimization~~ [FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce May 13, 2024

winningsix mentioned this issue May 13, 2024

Optimzing Expand+Aggregate in sqls with many count distinct [WIP] #10798

Open

winningsix assigned binmahone May 13, 2024

winningsix added the performance A performance related task/issue label May 13, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label May 16, 2024

sameerz removed the feature request New feature or request label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

winningsix commented May 13, 2024

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

Comments

winningsix commented May 13, 2024