Fix materialized CTE plan issue #11874

kryonix · 2024-04-30T10:56:09Z

This PR fixes an issue introduced in #10878, mentioned in #10878 (comment). The previous PR changed the way materialized CTEs are placed in plans. However, the CTE map was represented by an case_insensitive_map_t, which internally uses std::unordered_map. This is problematic, because we have to keep the insertion order of CTEs.

This PR changes case_insensitive_map_t to std::map, which preserves the insertion order of CTE map entries.

An alternative design could be to create a CTE dependency graph to be used in Transformer::TransformMaterializedCTE. However, I don't think that this is necessary at this point.

Tmonster

Thank you for the quick fix! Would you mind adding the reproduction script as a micro regression test?
If CI passes looks good to me 👍

Mytherin

Thanks for the PR!

While std::map is ordered, it is ordered in alphabetical order, not in insertion order. This happens to be correct for the given benchmark because the CTE names are alphabetical (mat_t1, mat_t2, etc) - but does not hold in general. If we change the names in the query so that they are sorted in descending order, the performance regression happens again:

with
Hmat_t1 as materialized (select * from t1),
Gmat_t2 as materialized (select * from Hmat_t1),
Fmat_t3 as materialized (select * from Gmat_t2 where id not in (
                        select id % 20 from Gmat_t2
                        )),
Emat_t4 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						)),
Dmat_t5 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						)),
Cmat_t6 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						)),
Bmat_t7 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						UNION ALL
						select (id % 20) + 80 from Cmat_t6
						)),
Amat_t8 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						UNION ALL
						select (id % 20) + 80 from Cmat_t6
						UNION ALL
						select (id % 20) + 80 from Bmat_t7
						))
Select * from Hmat_t1 UNION ALL
select * from Gmat_t2 UNION ALL
select * from Fmat_t3 UNION ALL
select * from Emat_t4 UNION ALL
select * from Dmat_t5 UNION ALL
select * from Cmat_t6 UNION ALL
select * from Bmat_t7 UNION ALL
select * from Amat_t8;

In order to keep insertion order we would need to turn the map into a vector<unique_ptr<CommonTableExpressionInfo>>, with perhaps a case_insensitive_map_t<string, idx_t> to do quick name-based lookups.

kryonix · 2024-04-30T21:18:00Z

Makes sense! I will change that accordingly.

kryonix · 2024-05-01T20:34:28Z

I've added a new field to the CTE map and to the serialization, too, as @Mytherin suggested. Hence I had to run the generate storage version script—which states that the storage version should be incremented. Is this necessary in this case?

Mytherin · 2024-05-02T07:34:08Z

I've added a new field to the CTE map and to the serialization, too, as @Mytherin suggested. Hence I had to run the generate storage version script—which states that the storage version should be incremented. Is this necessary in this case?

Hm, no actually we cannot do that anymore since that breaks backwards/forwards compatibility for views containing CTEs. Ideally we would keep the on-disk serialization the same. The previous serialization serializes as a list of key/value pairs. We can serialize the current representation in the same way - as long as we make sure to order the CTEs correctly (i.e. according to insertion order, not random/alphabetical order).

I think the cleanest way of doing this code-wise is to create a new class that represents the "insertion-order preserving map", e.g. insertion_order_preserving_map<string, unique_ptr<CommonTableExpressionInfo>>. Internally this can be represented using the vector + unordered_map combo. We can then special case the serialization/deserialization for this so that it is identical to the unordered_map/map.

Let me know if you want to pick this up - otherwise we can have a go at this as well.

This reverts commit 5dcba3d.

This commit adds an insertion order preserving map, while keeping the serialization format of a regular map.

kryonix · 2024-05-02T12:15:25Z

Hm, no actually we cannot do that anymore since that breaks backwards/forwards compatibility for views containing CTEs.

Yeah, I figured that. I had something along the lines of "the storage format is fixed now" in the back of my mind ;-)

I think the cleanest way of doing this code-wise is to create a new class that represents the "insertion-order preserving map"

I agree—And did just that. Thanks for the code pointers, those were really helpful.

Mytherin

Thanks! This looks great. One more comment:

src/include/duckdb/common/insertion_order_preserving_map.hpp

This commit simplifies usage of the insertion order preserving map massively. Instead of exposing the vector and map directly, only the necessary functions are exposed through an appropriate interface. This also hinders users to accidentally corrupt the map.

kryonix · 2024-05-08T11:33:49Z

I think this PR is ready for another round of reviewing. I've changed the insertion order map as proposed and provided an stl-style interface for it. This ensures that the vector and map stay synchronized, simplifies usage, and automatically does the right thing when used in for-each loops.

Mytherin · 2024-05-14T07:32:38Z

Thanks! LGTM - just had to resolve some merge conflicts

Merge pull request duckdb/duckdb#11874 from kryonix/cte_regression

Fix materialized CTE plan issue

b20894d

Tmonster approved these changes Apr 30, 2024

View reviewed changes

Add stacked materialized CTEs micro benchmark

3fcb18d

duckdb-draftbot marked this pull request as draft April 30, 2024 12:38

kryonix marked this pull request as ready for review April 30, 2024 12:38

Fix generated files

6970976

duckdb-draftbot marked this pull request as draft April 30, 2024 12:43

kryonix marked this pull request as ready for review April 30, 2024 12:43

Mytherin reviewed Apr 30, 2024

View reviewed changes

Mytherin added the Changes Requested label Apr 30, 2024

Replace CTE map with vector for storage and map for lookups

7d228ee

duckdb-draftbot marked this pull request as draft May 1, 2024 12:41

kryonix added 2 commits May 1, 2024 15:07

Adapt sqlsmith to cte map changes

bfc812f

Run generate storage version

5dcba3d

kryonix marked this pull request as ready for review May 1, 2024 20:31

kryonix added 2 commits May 2, 2024 14:04

Revert "Run generate storage version"

370dfb2

This reverts commit 5dcba3d.

Add InsertionOrderPreservingMap for CTEs

55f3a7e

This commit adds an insertion order preserving map, while keeping the serialization format of a regular map.

duckdb-draftbot marked this pull request as draft May 2, 2024 12:13

Remove unused KEY_TYPE definition

330ab3c

kryonix marked this pull request as ready for review May 2, 2024 12:31

Mytherin reviewed May 2, 2024

View reviewed changes

src/include/duckdb/common/insertion_order_preserving_map.hpp Outdated Show resolved Hide resolved

duckdb-draftbot marked this pull request as draft May 3, 2024 18:18

Fix implicit conversion signedness change issue

7ef2a1f

kryonix marked this pull request as ready for review May 8, 2024 06:38

Fix tidy-check

ef8b7b6

duckdb-draftbot marked this pull request as draft May 8, 2024 08:02

kryonix marked this pull request as ready for review May 8, 2024 08:02

Fix format check

8e6349b

duckdb-draftbot marked this pull request as draft May 8, 2024 08:09

kryonix marked this pull request as ready for review May 8, 2024 08:09

Merge branch 'main' into cte_regression

fa9f5ca

duckdb-draftbot marked this pull request as draft May 14, 2024 07:32

Mytherin marked this pull request as ready for review May 14, 2024 07:32

Mytherin added Ready To Merge and removed Changes Requested labels May 14, 2024

Fix merge conflicts

3b892a3

duckdb-draftbot marked this pull request as draft May 14, 2024 07:48

Mytherin marked this pull request as ready for review May 14, 2024 07:48

Fix merge conflicts

c62035f

duckdb-draftbot marked this pull request as draft May 14, 2024 08:51

Mytherin marked this pull request as ready for review May 14, 2024 08:52

Mytherin merged commit b1b3c8e into duckdb:main May 14, 2024
41 checks passed

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 15, 2024

chore: Update vendored sources to duckdb/duckdb@b1b3c8e

c8c0cb7

Merge pull request duckdb/duckdb#11874 from kryonix/cte_regression

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix materialized CTE plan issue #11874

Fix materialized CTE plan issue #11874

kryonix commented Apr 30, 2024 •

edited

Tmonster left a comment

Mytherin left a comment

kryonix commented Apr 30, 2024

kryonix commented May 1, 2024

Mytherin commented May 2, 2024

kryonix commented May 2, 2024

Mytherin left a comment

kryonix commented May 8, 2024

Mytherin commented May 14, 2024

Fix materialized CTE plan issue #11874

Fix materialized CTE plan issue #11874

Conversation

kryonix commented Apr 30, 2024 • edited

Tmonster left a comment

Choose a reason for hiding this comment

Mytherin left a comment

Choose a reason for hiding this comment

kryonix commented Apr 30, 2024

kryonix commented May 1, 2024

Mytherin commented May 2, 2024

kryonix commented May 2, 2024

Mytherin left a comment

Choose a reason for hiding this comment

kryonix commented May 8, 2024

Mytherin commented May 14, 2024

kryonix commented Apr 30, 2024 •

edited