Improve performance of distinct aggregations #21907

Dith3r · 2024-05-10T07:20:23Z

Description

Improve performance of distinct aggregations by defining additional strategies based on source properties (for example, NDV).

The strategy to use for multiple distinct aggregations.
SINGLE_STEP Computes distinct aggregations in single-step without any pre-aggregations.
This strategy will perform poorly if the number of distinct grouping keys is small.
MARK_DISTINCT uses MarkDistinct for multiple distinct aggregations
or for mix of distinct and non-distinct aggregations.
PRE_AGGREGATE Computes distinct aggregations using a combination of aggregation
and pre-aggregation steps.
AUTOMATIC chooses the strategy automatically.

Single-step strategy is preferred. However, for cases with limited concurrency due to
a small number of distinct grouping keys, it will choose an alternative strategy
based on input data statistics.

Strategy	Duration	Query
MARK_DISTINCT	5353ms	select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE	2468ms	select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
MARK_DISTINCT	10109ms	select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE	8253ms	select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

Extract the logic to determine whether the direct distinct aggregation applicability, which can be reused in multiple optimiser rules.

The rule replaces `OptimizeMixedDistinctAggregations`, and adds support for multiple distinct aggregations.

Also rename corresponding config property optimizer.mark-distinct-strategy to optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT

Replace optimizer.optimize-mixed-distinct-aggregations with a new optimizer.distinct-aggregations-strategy `pre_aggregate`

Use estimated aggregation source NDV and the number of grouping keys to decide if pre-aggregate strategy should be used for a given aggregation

lukasz-stec

lgtm % comments

lukasz-stec · 2024-05-21T08:24:07Z

core/trino-main/src/main/java/io/trino/SystemSessionProperties.java

    {
-        return session.getSystemProperty(MARK_DISTINCT_STRATEGY, MarkDistinctStrategy.class);
+        return session.getSystemProperty(DISTINCT_AGGREGATIONS_STRATEGY, DistinctAggregationsStrategy.class);


you should merge this commit with the commit that re-adds the MARK_DISTINCT_STRATEGY propertry

lukasz-stec · 2024-05-21T08:27:28Z

core/trino-main/src/main/java/io/trino/sql/planner/PlanOptimizers.java

@@ -683,6 +683,9 @@ public PlanOptimizers(
                                new RemoveRedundantIdentityProjections(),
                                new PushAggregationThroughOuterJoin(),
                                new ReplaceRedundantJoinWithSource(), // Run this after PredicatePushDown optimizer as it inlines filter constants
+                                new DistinctAggregationToGroupBy(plannerContext), // Run this after aggregation pushdown so that multiple distinct aggregations can be pushed into a connector
+                                // It also is run before MultipleDistinctAggregationToMarkDistinct to take precedence f enabled


nit: typo should be precedence if enabled

Extract DistinctAggregationController

41fc800

Extract the logic to determine whether the direct distinct aggregation applicability, which can be reused in multiple optimiser rules.

cla-bot bot added the cla-signed label May 10, 2024

Dith3r requested review from lukasz-stec and sopel39 May 10, 2024 07:20

github-actions bot added docs hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels May 10, 2024

Dith3r force-pushed the ke/dist-aggr branch 4 times, most recently from 0d37869 to abce86f Compare May 10, 2024 10:35

raunaqmorarka requested a review from martint May 10, 2024 11:47

Add DistinctAggregationToGroupBy

6639f5f

The rule replaces `OptimizeMixedDistinctAggregations`, and adds support for multiple distinct aggregations.

Dith3r force-pushed the ke/dist-aggr branch from abce86f to 64f7ae7 Compare May 10, 2024 11:51

lukasz-stec added 2 commits May 10, 2024 14:57

Rename DistinctAggregationsStrategy

4c5111d

Also rename corresponding config property optimizer.mark-distinct-strategy to optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT

Replace optimizer.optimize-mixed-distinct-aggregations

14cd99e

Replace optimizer.optimize-mixed-distinct-aggregations with a new optimizer.distinct-aggregations-strategy `pre_aggregate`

Dith3r force-pushed the ke/dist-aggr branch from 64f7ae7 to a8e0691 Compare May 10, 2024 12:58

Enable DistinctAggregationToGroupBy automatically

1dbce97

Use estimated aggregation source NDV and the number of grouping keys to decide if pre-aggregate strategy should be used for a given aggregation

Dith3r force-pushed the ke/dist-aggr branch from a8e0691 to 1dbce97 Compare May 13, 2024 10:19

lukasz-stec approved these changes May 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of distinct aggregations #21907

Improve performance of distinct aggregations #21907

Dith3r commented May 10, 2024 •

edited

lukasz-stec left a comment

lukasz-stec May 21, 2024

lukasz-stec May 21, 2024

Improve performance of distinct aggregations #21907

Are you sure you want to change the base?

Improve performance of distinct aggregations #21907

Conversation

Dith3r commented May 10, 2024 • edited

Description

Additional context and related issues

Release notes

lukasz-stec left a comment

Choose a reason for hiding this comment

lukasz-stec May 21, 2024

Choose a reason for hiding this comment

lukasz-stec May 21, 2024

Choose a reason for hiding this comment

Dith3r commented May 10, 2024 •

edited