Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce #10799

Open
winningsix opened this issue May 13, 2024 · 0 comments
Assignees
Labels
performance A performance related task/issue

Comments

@winningsix
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
In Spark, distinct will introduce an expand operator before aggregation. And the things done there are mostly around populating some null columns. However, in current implementation, it shows significant performance issue (e.g., 9X longer time than doing this on 16 CPU cores). We need to catch up performance at least similar level performance.

Describe the solution you'd like
#10560 was already mentioning an approach there. Besides that, from nsys trace, lots of making null columns bring significant negative impacts to overall performance. Also other optimizations came up from @binmahone around introducing a post coalesce after expand to increase batch size for aggregation.

@winningsix winningsix added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 13, 2024
@winningsix winningsix changed the title [FEA] Optimize count distinct performance optimization [FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce May 13, 2024
@winningsix winningsix added the performance A performance related task/issue label May 13, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 16, 2024
@sameerz sameerz removed the feature request New feature or request label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants