Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

yashmayya · 2024-05-14T09:59:10Z

This is a fix for Multi-stage intersect ignores all modifier #13126 and Multi-stage except ignores all modifier #13127
The SetOperator has been modified slightly to use a multiset instead of a regular set so that we can support the INTERSECT ALL and EXCEPT ALL set operations.
Alternatively, we could've extended the SetOperator class with a MultisetOperator class with the appropriate abstractions but since the hash multiset overhead isn't too high (essentially just an additional count maintained per element), this option was avoided in favor of the cleaner, uniform implementation.
Suggested labels - either feature or bugfix since the INTERSECT ALL, EXCEPT ALL set operations currently erroneously return the same results as INTERSECT, EXCEPT respectively.

Let's take an example to demonstrate the various set operation semantics and their differences. Suppose there's two tables A (with one column 'a') and B (with one column 'b') with the following rows:

A

B

b
--
2
3
3
4

Then, SELECT * FROM (SELECT a FROM A) INTERSECT (SELECT b FROM B) should return:

2
3

SELECT * FROM (SELECT a FROM A) INTERSECT ALL (SELECT b FROM B) should return:

2
3
3

SELECT * FROM (SELECT a FROM A) EXCEPT (SELECT b FROM B) should return:

and, SELECT * FROM (SELECT a FROM A) EXCEPT ALL (SELECT b FROM B) should return:

1
2
3

…e query engine

codecov-commenter · 2024-05-14T10:40:43Z

Codecov Report

Attention: Patch coverage is 70.37037% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 35.31%. Comparing base (59551e4) to head (02d54b4).
Report is 444 commits behind head on master.

Files	Patch %	Lines
.../pinot/query/runtime/plan/PhysicalPlanVisitor.java	33.33%	2 Missing and 2 partials ⚠️
...t/query/runtime/operator/IntersectAllOperator.java	71.42%	2 Missing ⚠️
...pinot/query/runtime/operator/MinusAllOperator.java	71.42%	2 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #13151       +/-   ##
=============================================
- Coverage     61.75%   35.31%   -26.44%     
+ Complexity      207        6      -201     
=============================================
  Files          2436     2442        +6     
  Lines        133233   134240     +1007     
  Branches      20636    20788      +152     
=============================================
- Hits          82274    47412    -34862     
- Misses        44911    83351    +38440     
+ Partials       6048     3477     -2571

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`35.27% <70.37%> (-26.43%)`	⬇️
java-21	`35.20% <70.37%> (-26.42%)`	⬇️
skip-bytebuffers-false	`35.30% <70.37%> (-26.44%)`	⬇️
skip-bytebuffers-true	`35.16% <70.37%> (+7.43%)`	⬆️
temurin	`35.31% <70.37%> (-26.44%)`	⬇️
unittests	`46.75% <70.37%> (-15.00%)`	⬇️
unittests1	`46.75% <70.37%> (-0.14%)`	⬇️
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yashmayya · 2024-05-14T10:50:10Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/MinusAllOperator.java

+ */
+public class MinusAllOperator extends SetOperator {
+  private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class);
+  private static final String EXPLAIN_NAME = "MINUS ALL";


@gortiz I just noticed that other operators don't use spaces in their explain name. Should this be MINUS_ALL or EXCEPT_ALL? Also, it doesn't seem like this is even used for the multi-stage query engine's explain plan output? If it's only for the v1 query engine then I guess it isn't too relevant anyway since these set operators can only be used in the intermediate stages of the v2 query engine right?

Hm actually, I guess we don't really need separate operators for these. We could instead just pass a boolean flag indicating whether or not the all modifier is set to the existing operators.

I think this is only used when executing explain implementation plan for. Can you verify if the flag is seen there?

Hm no it's not being used there either in fact.

The physical explain plan gets the explain name for the operator from the SetOpNode here (called from here). Do you think it makes sense to modify SetOpNode::explain to include whether or not the ALL modifier is set (something like INTERSECT_ALL / MINUS_ALL)?

Yes, I think it makes sense. Anyway, it can be done in another PR

gortiz · 2024-05-14T17:02:25Z

...t-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/IntersectOperator.java

@@ -58,6 +58,7 @@ public String toExplainString() {

  @Override
  protected boolean handleRowMatched(Object[] row) {
-    return _rightRowSet.remove(new Record(row));
+    Record record = new Record(row);
+    return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0;


Can we use MultiSet.setCount here instead? That should mean to execute one less lookup.

That should mean to execute one less lookup

Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.

The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0) has the following operations:

The explicit call to count results in one lookup on the underlying map.

The call to remove results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count.

Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.

The _rightRowSet.setCount(record, 0) != 0 approach has the following operations:

Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.

So, I think your proposed approach actually saves us two lookups!

gortiz · 2024-05-14T17:11:41Z

...-runtime/src/test/java/org/apache/pinot/query/runtime/operator/IntersectAllOperatorTest.java

+    Mockito.when(_leftOperator.nextBlock())
+        .thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3}))
+        .thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
+    Mockito.when(_rightOperator.nextBlock()).thenReturn(
+            OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4}))
+        .thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));


Do you think we could create an actual operator that receives a list of blocks and returns that? It would be great to do not use mocks and simplify these tests.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

Something like:

_leftOperator = new BlockListOperator.builder() .andThen(block1) .andThen(block2) .endWith(eosBlock(stats)) // or errorBlock(errors)

That is not specially better than mocks, but has the advantage of not returning null as soon as we call other method and it would mean there is a single place to apply changes in future refactors.

but has the advantage of not returning null as soon as we call other method

We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.

it would mean there is a single place to apply changes in future refactors

I didn't fully follow this part though.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.

I'm in general against using mocks. In my experience, they are a cheap solution at short term but problematic in the long run. For example, right now we are mocking operators in a lot of different places just to return a well known sequence of blocks. That can be easily done with a new operator instead of using a mock, but in the first time we needed, we just used mocks. Now that kind of mock is being replicated in a lot of different tests. Imagine that tomorrow we add a second method that is sometimes called in operators:

In the code that use mocks, we need to iterate over all the tests to verify that we add an answer for that new method. We need to iterate over these tests manually because the compiler won't help us here. We can run the tests and see which ones fail, but in case we have tests that try different mocks in the same method (something that is also an anti-pattern, but we have several cases like that) we would need to run tests several times.

In the code that use a test operator, we just need to implement that method once (in that class) and ideally the compiler help us here.

gortiz

Apart from the setCount change, it LGTM

…tor; update explain name for new operators

yashmayya

Thanks for the review @gortiz!

yashmayya · 2024-05-15T04:42:20Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/MinusAllOperator.java

+ */
+public class MinusAllOperator extends SetOperator {
+  private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class);
+  private static final String EXPLAIN_NAME = "MINUS ALL";


Hm no it's not being used there either in fact.

yashmayya · 2024-05-15T04:47:06Z

...t-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/IntersectOperator.java

@@ -58,6 +58,7 @@ public String toExplainString() {

  @Override
  protected boolean handleRowMatched(Object[] row) {
-    return _rightRowSet.remove(new Record(row));
+    Record record = new Record(row);
+    return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0;


That should mean to execute one less lookup

Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.

The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0) has the following operations:

The explicit call to count results in one lookup on the underlying map.

The call to remove results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count.

Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.

The _rightRowSet.setCount(record, 0) != 0 approach has the following operations:

Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.

So, I think your proposed approach actually saves us two lookups!

yashmayya · 2024-05-15T05:06:10Z

...-runtime/src/test/java/org/apache/pinot/query/runtime/operator/IntersectAllOperatorTest.java

+    Mockito.when(_leftOperator.nextBlock())
+        .thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3}))
+        .thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));
+    Mockito.when(_rightOperator.nextBlock()).thenReturn(
+            OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4}))
+        .thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0));


but has the advantage of not returning null as soon as we call other method

We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.

it would mean there is a single place to apply changes in future refactors

I didn't fully follow this part though.

It is not something I think we should do in this PR, but I want just to open that discussion for the future.

I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stag…

2feb5ad

…e query engine

yashmayya force-pushed the intersect-except-all branch from faa01ff to 2feb5ad Compare May 14, 2024 10:02

yashmayya marked this pull request as ready for review May 14, 2024 10:45

yashmayya commented May 14, 2024

View reviewed changes

gortiz reviewed May 14, 2024

View reviewed changes

Use Multiset::setCount to reduce number of lookups for IntersectOpera…

02d54b4

…tor; update explain name for new operators

yashmayya commented May 15, 2024

View reviewed changes

gortiz approved these changes May 16, 2024

View reviewed changes

gortiz merged commit 892ab7f into apache:master May 16, 2024
20 checks passed

yashmayya mentioned this pull request May 16, 2024

Include ALL modifier for set operators in multi-stage query engine physical explain plan #13166

Merged

This was referenced May 16, 2024

Multi-stage intersect ignores all modifier #13126

Closed

Multi-stage except ignores all modifier #13127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

yashmayya commented May 14, 2024

codecov-commenter commented May 14, 2024 •

edited

yashmayya May 14, 2024

yashmayya May 14, 2024

gortiz May 14, 2024

yashmayya May 15, 2024

yashmayya May 15, 2024

gortiz May 16, 2024

gortiz May 14, 2024

yashmayya May 15, 2024

gortiz May 14, 2024

yashmayya May 15, 2024

gortiz May 16, 2024

gortiz left a comment

yashmayya left a comment

yashmayya May 15, 2024

yashmayya May 15, 2024

yashmayya May 15, 2024

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151

Conversation

yashmayya commented May 14, 2024

A

B

codecov-commenter commented May 14, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gortiz left a comment

Choose a reason for hiding this comment

yashmayya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 14, 2024 •

edited