Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

Merged
merged 2 commits into from
May 16, 2024

Conversation

yashmayya
Copy link
Contributor

  • This is a fix for Multi-stage intersect ignores all modifier #13126 and Multi-stage except ignores all modifier #13127
  • The SetOperator has been modified slightly to use a multiset instead of a regular set so that we can support the INTERSECT ALL and EXCEPT ALL set operations.
  • Alternatively, we could've extended the SetOperator class with a MultisetOperator class with the appropriate abstractions but since the hash multiset overhead isn't too high (essentially just an additional count maintained per element), this option was avoided in favor of the cleaner, uniform implementation.
  • Suggested labels - either feature or bugfix since the INTERSECT ALL, EXCEPT ALL set operations currently erroneously return the same results as INTERSECT, EXCEPT respectively.

Let's take an example to demonstrate the various set operation semantics and their differences. Suppose there's two tables A (with one column 'a') and B (with one column 'b') with the following rows:

A

a
--
1
2
2
3
3
3

B

b
--
2
3
3
4

Then, SELECT * FROM (SELECT a FROM A) INTERSECT (SELECT b FROM B) should return:

2
3

SELECT * FROM (SELECT a FROM A) INTERSECT ALL (SELECT b FROM B) should return:

2
3
3

SELECT * FROM (SELECT a FROM A) EXCEPT (SELECT b FROM B) should return:

1

and, SELECT * FROM (SELECT a FROM A) EXCEPT ALL (SELECT b FROM B) should return:

1
2
3

@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

Attention: Patch coverage is 70.37037% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 35.31%. Comparing base (59551e4) to head (02d54b4).
Report is 444 commits behind head on master.

Files Patch % Lines
.../pinot/query/runtime/plan/PhysicalPlanVisitor.java 33.33% 2 Missing and 2 partials ⚠️
...t/query/runtime/operator/IntersectAllOperator.java 71.42% 2 Missing ⚠️
...pinot/query/runtime/operator/MinusAllOperator.java 71.42% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #13151       +/-   ##
=============================================
- Coverage     61.75%   35.31%   -26.44%     
+ Complexity      207        6      -201     
=============================================
  Files          2436     2442        +6     
  Lines        133233   134240     +1007     
  Branches      20636    20788      +152     
=============================================
- Hits          82274    47412    -34862     
- Misses        44911    83351    +38440     
+ Partials       6048     3477     -2571     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 35.27% <70.37%> (-26.43%) ⬇️
java-21 35.20% <70.37%> (-26.42%) ⬇️
skip-bytebuffers-false 35.30% <70.37%> (-26.44%) ⬇️
skip-bytebuffers-true 35.16% <70.37%> (+7.43%) ⬆️
temurin 35.31% <70.37%> (-26.44%) ⬇️
unittests 46.75% <70.37%> (-15.00%) ⬇️
unittests1 46.75% <70.37%> (-0.14%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yashmayya yashmayya marked this pull request as ready for review May 14, 2024 10:45
*/
public class MinusAllOperator extends SetOperator {
private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class);
private static final String EXPLAIN_NAME = "MINUS ALL";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gortiz I just noticed that other operators don't use spaces in their explain name. Should this be MINUS_ALL or EXCEPT_ALL? Also, it doesn't seem like this is even used for the multi-stage query engine's explain plan output? If it's only for the v1 query engine then I guess it isn't too relevant anyway since these set operators can only be used in the intermediate stages of the v2 query engine right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm actually, I guess we don't really need separate operators for these. We could instead just pass a boolean flag indicating whether or not the all modifier is set to the existing operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only used when executing explain implementation plan for. Can you verify if the flag is seen there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm no it's not being used there either in fact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical explain plan gets the explain name for the operator from the SetOpNode here (called from here). Do you think it makes sense to modify SetOpNode::explain to include whether or not the ALL modifier is set (something like INTERSECT_ALL / MINUS_ALL)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense. Anyway, it can be done in another PR

@@ -58,6 +58,7 @@ public String toExplainString() {

@Override
protected boolean handleRowMatched(Object[] row) {
return _rightRowSet.remove(new Record(row));
Record record = new Record(row);
return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use MultiSet.setCount here instead? That should mean to execute one less lookup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should mean to execute one less lookup

Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.

The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0) has the following operations:

  1. The explicit call to count results in one lookup on the underlying map.
  2. The call to remove results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count.
  3. Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.

The _rightRowSet.setCount(record, 0) != 0 approach has the following operations:

  1. Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.

So, I think your proposed approach actually saves us two lookups!

Comment on lines +68 to +73
Mockito.when(_leftOperator.nextBlock())
.thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3}))
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
Mockito.when(_rightOperator.nextBlock()).thenReturn(
OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4}))
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could create an actual operator that receives a list of blocks and returns that? It would be great to do not use mocks and simplify these tests.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

Something like:

_leftOperator = new BlockListOperator.builder()
     .andThen(block1)
     .andThen(block2)
     .endWith(eosBlock(stats)) // or errorBlock(errors)

That is not specially better than mocks, but has the advantage of not returning null as soon as we call other method and it would mean there is a single place to apply changes in future refactors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but has the advantage of not returning null as soon as we call other method

We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.

it would mean there is a single place to apply changes in future refactors

I didn't fully follow this part though.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in general against using mocks. In my experience, they are a cheap solution at short term but problematic in the long run. For example, right now we are mocking operators in a lot of different places just to return a well known sequence of blocks. That can be easily done with a new operator instead of using a mock, but in the first time we needed, we just used mocks. Now that kind of mock is being replicated in a lot of different tests. Imagine that tomorrow we add a second method that is sometimes called in operators:

  • In the code that use mocks, we need to iterate over all the tests to verify that we add an answer for that new method. We need to iterate over these tests manually because the compiler won't help us here. We can run the tests and see which ones fail, but in case we have tests that try different mocks in the same method (something that is also an anti-pattern, but we have several cases like that) we would need to run tests several times.
  • In the code that use a test operator, we just need to implement that method once (in that class) and ideally the compiler help us here.

Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the setCount change, it LGTM

Copy link
Contributor Author

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @gortiz!

*/
public class MinusAllOperator extends SetOperator {
private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class);
private static final String EXPLAIN_NAME = "MINUS ALL";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm no it's not being used there either in fact.

@@ -58,6 +58,7 @@ public String toExplainString() {

@Override
protected boolean handleRowMatched(Object[] row) {
return _rightRowSet.remove(new Record(row));
Record record = new Record(row);
return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should mean to execute one less lookup

Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.

The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0) has the following operations:

  1. The explicit call to count results in one lookup on the underlying map.
  2. The call to remove results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count.
  3. Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.

The _rightRowSet.setCount(record, 0) != 0 approach has the following operations:

  1. Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.

So, I think your proposed approach actually saves us two lookups!

Comment on lines +68 to +73
Mockito.when(_leftOperator.nextBlock())
.thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3}))
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
Mockito.when(_rightOperator.nextBlock()).thenReturn(
OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4}))
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but has the advantage of not returning null as soon as we call other method

We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.

it would mean there is a single place to apply changes in future refactors

I didn't fully follow this part though.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants