Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HBO support for CTE materialization #22606

Merged
merged 1 commit into from May 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -37,6 +37,8 @@

import java.util.Map;

import static com.facebook.presto.SystemSessionProperties.CTE_MATERIALIZATION_STRATEGY;
import static com.facebook.presto.SystemSessionProperties.CTE_PARTITIONING_PROVIDER_CATALOG;
import static com.facebook.presto.SystemSessionProperties.HISTORY_BASED_OPTIMIZATION_PLAN_CANONICALIZATION_STRATEGY;
import static com.facebook.presto.SystemSessionProperties.JOIN_DISTRIBUTION_TYPE;
import static com.facebook.presto.SystemSessionProperties.PARTIAL_AGGREGATION_STRATEGY;
Expand Down Expand Up @@ -258,6 +260,23 @@ public void testPartialAggStatisticsGroupByPartKey()
}
}

@Test
public void testHistoryBasedStatsCalculatorCTE()
{
String sql = "with t1 as (select orderkey, orderstatus from orders where totalprice > 100), t2 as (select orderkey, totalprice from orders where custkey > 100) " +
"select orderstatus, sum(totalprice) from t1 join t2 on t1.orderkey=t2.orderkey group by orderstatus";
Session cteMaterialization = Session.builder(defaultSession())
.setSystemProperty(CTE_MATERIALIZATION_STRATEGY, "ALL")
.setSystemProperty(CTE_PARTITIONING_PROVIDER_CATALOG, "hive")
.build();
// CBO Statistics
assertPlan(cteMaterialization, sql, anyTree(node(ProjectNode.class, anyTree(any())).withOutputRowCount(Double.NaN)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feilong-liu will this also work if the cte materialization was off in the first execution and then on in the second?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the opposite case, if cte materialization was on for the first run and then off again. I am a little confused and not sure it will work as expected in all these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, both cases are not supported. History of CTE materialization runs only apply to CTE materialization runs and non cte materialization runs only apply to non cte materialization runs.

The challenge of support history share between CTE and non CTE runs is that, for non CTE materialization runs, the CTE will be inlined and expanded in query plan, which will have different variable name and plan node ids, and may also go through different optimizations which results in different plans. In order to match the plan hash of CTEs between CTE and non CTE runs, we need to carefully design and change the HBO related plan caononicalization strategy and code, which is non-trival.

So I prefer to first enable HBO for history to be used within CTE matrilizations queries. Otherwise HBO will be barely available (as the CTE related node is not canonicalized and will return empty in hash, which will lead to all other operators which has these CTE nodes as child also return empty in hash)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, I have fixed the planNodeIds and now they are canonicalized as expected in an earlier commit.
For the different variable names, I believe we can still use the underlying plan statistics available in cteConsumer/producer (original node HBO stats). These statistics will be accessible if the previous query was run in non-CTE materialized mode. However linking with the execution stats is unclear


// HBO Statistics
executeAndTrackHistory(sql, cteMaterialization);
assertPlan(cteMaterialization, sql, anyTree(node(ProjectNode.class, anyTree(any())).withOutputRowCount(3)));
}

@Override
protected void assertPlan(@Language("SQL") String query, PlanMatchPattern pattern)
{
Expand Down
Expand Up @@ -24,6 +24,8 @@
import com.facebook.presto.spi.plan.AggregationNode.Aggregation;
import com.facebook.presto.spi.plan.AggregationNode.GroupingSetDescriptor;
import com.facebook.presto.spi.plan.Assignments;
import com.facebook.presto.spi.plan.CteConsumerNode;
import com.facebook.presto.spi.plan.CteProducerNode;
import com.facebook.presto.spi.plan.DistinctLimitNode;
import com.facebook.presto.spi.plan.EquiJoinClause;
import com.facebook.presto.spi.plan.FilterNode;
Expand Down Expand Up @@ -53,6 +55,7 @@
import com.facebook.presto.sql.planner.plan.JoinNode;
import com.facebook.presto.sql.planner.plan.RowNumberNode;
import com.facebook.presto.sql.planner.plan.SemiJoinNode;
import com.facebook.presto.sql.planner.plan.SequenceNode;
import com.facebook.presto.sql.planner.plan.TableFinishNode;
import com.facebook.presto.sql.planner.plan.TableWriterNode;
import com.facebook.presto.sql.planner.plan.TopNRowNumberNode;
Expand Down Expand Up @@ -830,6 +833,25 @@ public Optional<PlanNode> visitAggregation(AggregationNode node, Context context
return Optional.of(canonicalPlan);
}

@Override
public Optional<PlanNode> visitSequence(SequenceNode node, Context context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case where the sequence node is also canonicalized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end to end test in testHistoryBasedStatsCalculatorCTE should suffice to make sure that it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially thinking that since the matching happens at the PlanFragment level, the Sequence would have been rewritten anyway (fyi - removing this method does not cause the test testHistoryBasedStatsCalculatorCTE to fail). But leaving it in is probably safe and future-proof

{
node.getCteProducers().forEach(x -> x.accept(this, context));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to visit producers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's to trigger canonicalization of plan nodes under the cte producer, which will also need HBO to optimize the subquery.

return node.getPrimarySource().accept(this, context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sequence node, just return the canonicalized primary source

}

@Override
public Optional<PlanNode> visitCteProducer(CteProducerNode node, Context context)
{
return node.getSource().accept(this, context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cte producer, canonicalize its source

}

@Override
public Optional<PlanNode> visitCteConsumer(CteConsumerNode node, Context context)
{
return node.getOriginalSource().accept(this, context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonicalize original source for cte consumer

}

private Aggregation getCanonicalAggregation(Aggregation aggregation, Map<VariableReferenceExpression, VariableReferenceExpression> context)
{
return new Aggregation(
Expand Down