New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix materialized CTE plan issue #11874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick fix! Would you mind adding the reproduction script as a micro regression test?
If CI passes looks good to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
While std::map
is ordered, it is ordered in alphabetical order, not in insertion order. This happens to be correct for the given benchmark because the CTE names are alphabetical (mat_t1, mat_t2
, etc) - but does not hold in general. If we change the names in the query so that they are sorted in descending order, the performance regression happens again:
with
Hmat_t1 as materialized (select * from t1),
Gmat_t2 as materialized (select * from Hmat_t1),
Fmat_t3 as materialized (select * from Gmat_t2 where id not in (
select id % 20 from Gmat_t2
)),
Emat_t4 as materialized (select * from Hmat_t1 where id not in (
select (id % 20) + 20 from Gmat_t2
UNION ALL
select (id % 20) + 40 from Fmat_t3
)),
Dmat_t5 as materialized (select * from Hmat_t1 where id not in (
select (id % 20) + 20 from Gmat_t2
UNION ALL
select (id % 20) + 40 from Fmat_t3
UNION ALL
select (id % 20) + 60 from Emat_t4
)),
Cmat_t6 as materialized (select * from Hmat_t1 where id not in (
select (id % 20) + 20 from Gmat_t2
UNION ALL
select (id % 20) + 40 from Fmat_t3
UNION ALL
select (id % 20) + 60 from Emat_t4
UNION ALL
select (id % 20) + 80 from Dmat_t5
)),
Bmat_t7 as materialized (select * from Hmat_t1 where id not in (
select (id % 20) + 20 from Gmat_t2
UNION ALL
select (id % 20) + 40 from Fmat_t3
UNION ALL
select (id % 20) + 60 from Emat_t4
UNION ALL
select (id % 20) + 80 from Dmat_t5
UNION ALL
select (id % 20) + 80 from Cmat_t6
)),
Amat_t8 as materialized (select * from Hmat_t1 where id not in (
select (id % 20) + 20 from Gmat_t2
UNION ALL
select (id % 20) + 40 from Fmat_t3
UNION ALL
select (id % 20) + 60 from Emat_t4
UNION ALL
select (id % 20) + 80 from Dmat_t5
UNION ALL
select (id % 20) + 80 from Cmat_t6
UNION ALL
select (id % 20) + 80 from Bmat_t7
))
Select * from Hmat_t1 UNION ALL
select * from Gmat_t2 UNION ALL
select * from Fmat_t3 UNION ALL
select * from Emat_t4 UNION ALL
select * from Dmat_t5 UNION ALL
select * from Cmat_t6 UNION ALL
select * from Bmat_t7 UNION ALL
select * from Amat_t8;
In order to keep insertion order we would need to turn the map
into a vector<unique_ptr<CommonTableExpressionInfo>>
, with perhaps a case_insensitive_map_t<string, idx_t>
to do quick name-based lookups.
Makes sense! I will change that accordingly. |
I've added a new field to the CTE map and to the serialization, too, as @Mytherin suggested. Hence I had to run the generate storage version script—which states that the storage version should be incremented. Is this necessary in this case? |
Hm, no actually we cannot do that anymore since that breaks backwards/forwards compatibility for views containing CTEs. Ideally we would keep the on-disk serialization the same. The previous serialization serializes as a list of key/value pairs. We can serialize the current representation in the same way - as long as we make sure to order the CTEs correctly (i.e. according to insertion order, not random/alphabetical order). I think the cleanest way of doing this code-wise is to create a new class that represents the "insertion-order preserving map", e.g. Let me know if you want to pick this up - otherwise we can have a go at this as well. |
This reverts commit 5dcba3d.
This commit adds an insertion order preserving map, while keeping the serialization format of a regular map.
Yeah, I figured that. I had something along the lines of "the storage format is fixed now" in the back of my mind ;-)
I agree—And did just that. Thanks for the code pointers, those were really helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks great. One more comment:
This commit simplifies usage of the insertion order preserving map massively. Instead of exposing the vector and map directly, only the necessary functions are exposed through an appropriate interface. This also hinders users to accidentally corrupt the map.
I think this PR is ready for another round of reviewing. I've changed the insertion order map as proposed and provided an stl-style interface for it. This ensures that the vector and map stay synchronized, simplifies usage, and automatically does the right thing when used in |
Thanks! LGTM - just had to resolve some merge conflicts |
Merge pull request duckdb/duckdb#11874 from kryonix/cte_regression
This PR fixes an issue introduced in #10878, mentioned in #10878 (comment). The previous PR changed the way materialized CTEs are placed in plans. However, the CTE map was represented by an
case_insensitive_map_t
, which internally usesstd::unordered_map
. This is problematic, because we have to keep the insertion order of CTEs.This PR changes
case_insensitive_map_t
tostd::map
, which preserves the insertion order of CTE map entries.An alternative design could be to create a CTE dependency graph to be used in
Transformer::TransformMaterializedCTE
. However, I don't think that this is necessary at this point.