Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DYOD] Only use valid UCCs in optimization and limit cacheability of plans optimized using UCCs #2600

Draft
wants to merge 8 commits into
base: j-hellenberg/maintain_data_dependencies
Choose a base branch
from

Conversation

j-hellenberg
Copy link
Contributor

@j-hellenberg j-hellenberg commented Aug 2, 2023

Make sure to only use UCCs in optimization that are guaranteed_to_be_valid.

Also, we want to ensure that Hyrise only caches logical and physical query plans that cannot become invalid, because executing an invalid query plan might produce incorrect results. Query plans could become invalid if they have been optimized using UCCs that are not permanent, e.g., incidental UCCs discovered by the UccDiscoveryPlugin. The same can happen if a Functional Dependency (FD) was used that was derived from a non-permanent UCC. We ensure correct caching by having each optimization rule return whether the resulting Logical Query Plan can be cached.

Note that this PR currently targets our first PR branch, in order to not include the changes from there again here :) Please review the other PR before this one. We will change the PR to target master before merging.

@j-hellenberg j-hellenberg added the FullCI Run all CI tests (slow, but required for merge) label Aug 2, 2023
@j-hellenberg j-hellenberg force-pushed the j-hellenberg/only_use_valid_uccs_in_optimization branch from 8016545 to 1ef774f Compare August 2, 2023 16:16
@SanJSp SanJSp self-requested a review August 4, 2023 16:36
Copy link
Contributor

@SanJSp SanJSp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A looooot of changes, but good job!

Again only minor comments and all in all great changes! Feel free to consider my comments and if they're not relevant, skip them.

For the future: Maybe, when touching so many files, it makes sense to specify an order in which the reviewer can work through the files.

Happy coding!

@@ -6,7 +6,11 @@ namespace hyrise {

FunctionalDependency::FunctionalDependency(ExpressionUnorderedSet init_determinants,
ExpressionUnorderedSet init_dependents)
: determinants(std::move(init_determinants)), dependents(std::move(init_dependents)) {
: FunctionalDependency(std::move(init_determinants), std::move(init_dependents), true) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recently seen this in somebody else's code and I thought it improves readability. Maybe you find it useful as well?

Suggested change
: FunctionalDependency(std::move(init_determinants), std::move(init_dependents), true) {}
: FunctionalDependency(std::move(init_determinants), std::move(init_dependents), /* permanent=*/ true) {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wondered why this is being set to true. Maybe it makes sense to explain this in a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've recently seen this in somebody else's code and I thought it improves readability. Maybe you find it useful as well?

While they improve readability, I personally dislike adding comments like this because they need to be maintained as well and can easily become out of sync. Also, in this case, the constructor that is called is just two lines below that, so one can easily look there to see how the parameter is named.

I also wondered why this is being set to true. Maybe it makes sense to explain this in a comment?

I added a comment for that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see comment in header)

@@ -26,16 +26,25 @@ namespace hyrise {
*
* Currently, the determinant expressions are required to be non-nullable to be involved in FDs. Combining null values
* and FDs is not trivial. For more reference, see https://arxiv.org/abs/1404.4963.
*
* If the FD may become invalid in the future (because it is not based on a schema constraint, but on the data
* incidentally fulfilling the constraint at the moment), the FD is marked as being not permanent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is the first mention of the term permanent, it maybe makes sense to explain it and its consequences quickly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment :)

src/lib/optimizer/optimizer.hpp Show resolved Hide resolved
Comment on lines +12 to +16
IsCacheable operator&(IsCacheable lhs, IsCacheable rhs) {
return static_cast<IsCacheable>(static_cast<bool>(lhs) & static_cast<bool>(rhs));
}

IsCacheable& operator&=(IsCacheable& lhs, const IsCacheable rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: why is only IsCacheable parameter passed as reference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if I get the question right: lhs is passed by reference, because we actually want to modify this parameter with the operator. We don't want to modify rhs, so this parameter is passed by value.
But I guess, the return type of the function can be IsCacheable instead of IsCacheable&. Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the explanation regarding the lhs is correct, but returning a reference is just the normal way to implement that operator: returning the result of the assignment again instead of making a copy.

@@ -187,6 +197,8 @@ void DependentGroupByReductionRule::_apply_to_plan_without_subqueries(

return LQPVisitation::VisitInputs;
});

return rule_was_applied_using_non_permanent_ucc ? IsCacheable::No : IsCacheable::Yes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this aligns with the coding guidelines. Personally, I'd prefer this expression, but maybe the teaching team can leave a comment here. The rule I'm referring to is this one:

Use braced control statements, even for single-line blocks. Moreover, unless the block is empty (e.g., while (!ready) {}), add line breaks. Instead of if (...) x(); (or if (...) { x(); }), write:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this coding guideline applies to ternaries. I think the point of this guideline is to not write something like

if (...)
   foo()

and later accidentally change it to

if (...)
   foo()
   bar()

without adding the necessary brackets, which is interpreted as

if (...) {
   foo()
}
bar()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely right, conditional/ternary operator is perfectly fine.

src/test/lib/sql/sql_pipeline_statement_test.cpp Outdated Show resolved Hide resolved
src/test/lib/sql/sql_pipeline_statement_test.cpp Outdated Show resolved Hide resolved
src/test/lib/sql/sql_pipeline_statement_test.cpp Outdated Show resolved Hide resolved
@j-hellenberg j-hellenberg force-pushed the j-hellenberg/maintain_data_dependencies branch from d214251 to b3b721c Compare August 22, 2023 09:21
j-hellenberg and others added 8 commits August 22, 2023 11:46
…uaranteed to be currently valid

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…y be false if they were derived from a table key constraint which can become invalid

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…e resulting query plan is cacheable

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
… we need to be able to call is_permanent on it

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
…query plan and only cache cacheable query plans

Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
Co-authored-by: kalathen <kalathen@jyu.fi>
Co-authored-by: Margarete <margarete.dippel@student.hpi.de>
@j-hellenberg j-hellenberg force-pushed the j-hellenberg/only_use_valid_uccs_in_optimization branch from a5382c1 to be582b7 Compare August 22, 2023 12:52
Copy link
Member

@dey4ss dey4ss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like your code and most comments are just minor remarks or ideas on design choices (and some code guidelines though). Did not really find functionality issues or bugs, nice work.

DebugAssert(!expressions.empty(), "Invalid input. Set of expressions should not be empty.");
DebugAssert(has_output_expressions(expressions),
"The given expressions are not a subset of the LQP's output expressions.");

const auto& unique_column_combinations = this->unique_column_combinations();
if (unique_column_combinations.empty()) {
return false;
return std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return {};

* sanity checks.
*/
bool has_matching_ucc(const ExpressionUnorderedSet& expressions) const;
std::optional<UniqueColumnCombination> get_matching_ucc(const ExpressionUnorderedSet& expressions) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess that's a matter of taste, but we also have find_column_id with comparable semantics, so find_matching_ucc?

@@ -139,7 +139,7 @@ UniqueColumnCombinations AggregateNode::unique_column_combinations() const {

// Make sure that we do not add an already existing or a superset UCC.
if (unique_column_combinations.empty() ||
!contains_matching_unique_column_combination(unique_column_combinations, group_by_columns)) {
!get_matching_unique_column_combination(unique_column_combinations, group_by_columns).has_value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need has_value() here?

*/
struct FunctionalDependency {
FunctionalDependency(ExpressionUnorderedSet init_determinants, ExpressionUnorderedSet init_dependents);
FunctionalDependency(ExpressionUnorderedSet init_determinants, ExpressionUnorderedSet init_dependents,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pick here, seen it multiple times: Just use a default parameter instead of two constructors

FunctionalDependency(ExpressionUnorderedSet init_determinants, ExpressionUnorderedSet init_dependents,
                     bool permanent = true);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not 100% happy with the naming. An FD, e.g., also stems from an Aggregate (group by columns determine aggregates). So permanent might be misleading. Maybe sth like time_independent? Open for better ideas.


bool operator==(const FunctionalDependency& other) const;
bool operator!=(const FunctionalDependency& other) const;
size_t hash() const;

bool is_permanent() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any means to create an accessor for a public member?

@@ -446,4 +446,11 @@ TEST_F(BetweenCompositionTest, HandleMultipleEqualExpressions) {
EXPECT_LQP_EQ(result_lqp, expected_lqp);
}

TEST_F(BetweenCompositionTest, CheckCacheability) {
const auto input_lqp = PredicateNode::make(equals_(_a_a, 100), PredicateNode::make(equals_(_a_b, 100), _node_a));
const auto lqp_result = StrategyBaseTest::apply_rule_with_cacheability_check(_rule, input_lqp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess we don't need StrategyBaseTest:: prefix here?

const auto predicate_node = PredicateNode::make(equals_(lqp_column_(stored_table_node, ColumnID{0}), "zzz"));
predicate_node->set_left_input(stored_table_node);

const auto lqp_result = StrategyBaseTest::apply_rule_with_cacheability_check(_rule, predicate_node);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

stored_table_node);
const auto lqp_result = apply_rule_with_cacheability_check(rule, lqp);
const auto cacheable = lqp_result.cacheable;
EXPECT_EQ(cacheable, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXPECT_TRUE

Comment on lines +380 to +382
const auto lqp_result = StrategyBaseTest::apply_rule_with_cacheability_check(rule, input_lqp);
const auto cacheable = lqp_result.cacheable;
EXPECT_EQ(cacheable, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above, skip this comment for the next few occurrences)

@@ -274,17 +274,18 @@ bool AbstractLQPNode::is_column_nullable(const ColumnID column_id) const {
return left_input()->is_column_nullable(column_id);
}

bool AbstractLQPNode::has_matching_ucc(const ExpressionUnorderedSet& expressions) const {
std::optional<UniqueColumnCombination> AbstractLQPNode::get_matching_ucc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never really access the UCCs, right? We only care about their cacheability later, don't we? So maybe return std::pair<bool, IsCacheable> directly?

Comment on lines +110 to +112
if (!table->constraint_guaranteed_to_be_valid(table_key_constraint)) {
continue;
}
Copy link
Member

@dey4ss dey4ss Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You discard a possibly valid UCC in an edge case here: Imagine UCC was valid for CommitID 5. Our transaction runs on a snapshot w/ CommitID 7. Another transaction committed with CommitID 8 and inserted tuples that could possibly invalidate the UCC. However, the UCC is still valid for the snapshot that our transaction sees. We cannot derive this knowledge only using the max begin CID, so we should add a comment here.

@dey4ss
Copy link
Member

dey4ss commented Apr 30, 2024

Put this on hold until end of August. We plan to incorporate this feature, but there is currently no capacity for it. Will work on it when back from the internship.

@dey4ss dey4ss closed this Apr 30, 2024
@dey4ss dey4ss reopened this Apr 30, 2024
@dey4ss dey4ss marked this pull request as draft April 30, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FullCI Run all CI tests (slow, but required for merge)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants