Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix materialized CTE plan issue #11874

Merged
merged 16 commits into from May 14, 2024
Merged

Fix materialized CTE plan issue #11874

merged 16 commits into from May 14, 2024

Conversation

kryonix
Copy link
Contributor

@kryonix kryonix commented Apr 30, 2024

This PR fixes an issue introduced in #10878, mentioned in #10878 (comment). The previous PR changed the way materialized CTEs are placed in plans. However, the CTE map was represented by an case_insensitive_map_t, which internally uses std::unordered_map. This is problematic, because we have to keep the insertion order of CTEs.

This PR changes case_insensitive_map_t to std::map, which preserves the insertion order of CTE map entries.


An alternative design could be to create a CTE dependency graph to be used in Transformer::TransformMaterializedCTE. However, I don't think that this is necessary at this point.

Copy link
Contributor

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix! Would you mind adding the reproduction script as a micro regression test?
If CI passes looks good to me 👍

@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 30, 2024 12:38
@kryonix kryonix marked this pull request as ready for review April 30, 2024 12:38
@duckdb-draftbot duckdb-draftbot marked this pull request as draft April 30, 2024 12:43
@kryonix kryonix marked this pull request as ready for review April 30, 2024 12:43
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

While std::map is ordered, it is ordered in alphabetical order, not in insertion order. This happens to be correct for the given benchmark because the CTE names are alphabetical (mat_t1, mat_t2, etc) - but does not hold in general. If we change the names in the query so that they are sorted in descending order, the performance regression happens again:

with
Hmat_t1 as materialized (select * from t1),
Gmat_t2 as materialized (select * from Hmat_t1),
Fmat_t3 as materialized (select * from Gmat_t2 where id not in (
                        select id % 20 from Gmat_t2
                        )),
Emat_t4 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						)),
Dmat_t5 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						)),
Cmat_t6 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						)),
Bmat_t7 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						UNION ALL
						select (id % 20) + 80 from Cmat_t6
						)),
Amat_t8 as materialized (select * from Hmat_t1 where id not in (
						select (id % 20) + 20 from Gmat_t2
						UNION ALL
						select (id % 20) + 40 from Fmat_t3
						UNION ALL
						select (id % 20) + 60 from Emat_t4
						UNION ALL
						select (id % 20) + 80 from Dmat_t5
						UNION ALL
						select (id % 20) + 80 from Cmat_t6
						UNION ALL
						select (id % 20) + 80 from Bmat_t7
						))
Select * from Hmat_t1 UNION ALL
select * from Gmat_t2 UNION ALL
select * from Fmat_t3 UNION ALL
select * from Emat_t4 UNION ALL
select * from Dmat_t5 UNION ALL
select * from Cmat_t6 UNION ALL
select * from Bmat_t7 UNION ALL
select * from Amat_t8;

In order to keep insertion order we would need to turn the map into a vector<unique_ptr<CommonTableExpressionInfo>>, with perhaps a case_insensitive_map_t<string, idx_t> to do quick name-based lookups.

@kryonix
Copy link
Contributor Author

kryonix commented Apr 30, 2024

Makes sense! I will change that accordingly.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 1, 2024 12:41
@kryonix kryonix marked this pull request as ready for review May 1, 2024 20:31
@kryonix
Copy link
Contributor Author

kryonix commented May 1, 2024

I've added a new field to the CTE map and to the serialization, too, as @Mytherin suggested. Hence I had to run the generate storage version script—which states that the storage version should be incremented. Is this necessary in this case?

@Mytherin
Copy link
Collaborator

Mytherin commented May 2, 2024

I've added a new field to the CTE map and to the serialization, too, as @Mytherin suggested. Hence I had to run the generate storage version script—which states that the storage version should be incremented. Is this necessary in this case?

Hm, no actually we cannot do that anymore since that breaks backwards/forwards compatibility for views containing CTEs. Ideally we would keep the on-disk serialization the same. The previous serialization serializes as a list of key/value pairs. We can serialize the current representation in the same way - as long as we make sure to order the CTEs correctly (i.e. according to insertion order, not random/alphabetical order).

I think the cleanest way of doing this code-wise is to create a new class that represents the "insertion-order preserving map", e.g. insertion_order_preserving_map<string, unique_ptr<CommonTableExpressionInfo>>. Internally this can be represented using the vector + unordered_map combo. We can then special case the serialization/deserialization for this so that it is identical to the unordered_map/map.

Let me know if you want to pick this up - otherwise we can have a go at this as well.

This commit adds an insertion order preserving map,
while keeping the serialization format of a regular map.
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 2, 2024 12:13
@kryonix
Copy link
Contributor Author

kryonix commented May 2, 2024

Hm, no actually we cannot do that anymore since that breaks backwards/forwards compatibility for views containing CTEs.

Yeah, I figured that. I had something along the lines of "the storage format is fixed now" in the back of my mind ;-)

I think the cleanest way of doing this code-wise is to create a new class that represents the "insertion-order preserving map"

I agree—And did just that. Thanks for the code pointers, those were really helpful.

@kryonix kryonix marked this pull request as ready for review May 2, 2024 12:31
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks great. One more comment:

This commit simplifies usage of the insertion order preserving
map massively. Instead of exposing the vector and map directly,
only the necessary functions are exposed through an appropriate
interface. This also hinders users to accidentally corrupt the map.
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 3, 2024 18:18
@kryonix kryonix marked this pull request as ready for review May 8, 2024 06:38
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 8, 2024 08:02
@kryonix kryonix marked this pull request as ready for review May 8, 2024 08:02
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 8, 2024 08:09
@kryonix kryonix marked this pull request as ready for review May 8, 2024 08:09
@kryonix
Copy link
Contributor Author

kryonix commented May 8, 2024

I think this PR is ready for another round of reviewing. I've changed the insertion order map as proposed and provided an stl-style interface for it. This ensures that the vector and map stay synchronized, simplifies usage, and automatically does the right thing when used in for-each loops.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 14, 2024 07:32
@Mytherin Mytherin marked this pull request as ready for review May 14, 2024 07:32
@Mytherin
Copy link
Collaborator

Thanks! LGTM - just had to resolve some merge conflicts

@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 14, 2024 07:48
@Mytherin Mytherin marked this pull request as ready for review May 14, 2024 07:48
@duckdb-draftbot duckdb-draftbot marked this pull request as draft May 14, 2024 08:51
@Mytherin Mytherin marked this pull request as ready for review May 14, 2024 08:52
@Mytherin Mytherin merged commit b1b3c8e into duckdb:main May 14, 2024
41 checks passed
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request May 15, 2024
Merge pull request duckdb/duckdb#11874 from kryonix/cte_regression
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants