-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement INTERSECT ALL, EXCEPT ALL set operations for the multi-stage query engine #13151
Conversation
faa01ff
to
2feb5ad
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13151 +/- ##
=============================================
- Coverage 61.75% 35.31% -26.44%
+ Complexity 207 6 -201
=============================================
Files 2436 2442 +6
Lines 133233 134240 +1007
Branches 20636 20788 +152
=============================================
- Hits 82274 47412 -34862
- Misses 44911 83351 +38440
+ Partials 6048 3477 -2571
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
*/ | ||
public class MinusAllOperator extends SetOperator { | ||
private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class); | ||
private static final String EXPLAIN_NAME = "MINUS ALL"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gortiz I just noticed that other operators don't use spaces in their explain name. Should this be MINUS_ALL
or EXCEPT_ALL
? Also, it doesn't seem like this is even used for the multi-stage query engine's explain plan output? If it's only for the v1 query engine then I guess it isn't too relevant anyway since these set operators can only be used in the intermediate stages of the v2 query engine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm actually, I guess we don't really need separate operators for these. We could instead just pass a boolean flag indicating whether or not the all
modifier is set to the existing operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only used when executing explain implementation plan for
. Can you verify if the flag is seen there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm no it's not being used there either in fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it makes sense. Anyway, it can be done in another PR
@@ -58,6 +58,7 @@ public String toExplainString() { | |||
|
|||
@Override | |||
protected boolean handleRowMatched(Object[] row) { | |||
return _rightRowSet.remove(new Record(row)); | |||
Record record = new Record(row); | |||
return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use MultiSet.setCount
here instead? That should mean to execute one less lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should mean to execute one less lookup
Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.
The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0
) has the following operations:
- The explicit call to
count
results in one lookup on the underlying map. - The call to
remove
results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count. - Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.
The _rightRowSet.setCount(record, 0) != 0
approach has the following operations:
- Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.
So, I think your proposed approach actually saves us two lookups!
Mockito.when(_leftOperator.nextBlock()) | ||
.thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3})) | ||
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0)); | ||
Mockito.when(_rightOperator.nextBlock()).thenReturn( | ||
OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4})) | ||
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could create an actual operator that receives a list of blocks and returns that? It would be great to do not use mocks and simplify these tests.
It is not something I think we should do in this PR, but I want just to open that discussion for the future.
Something like:
_leftOperator = new BlockListOperator.builder()
.andThen(block1)
.andThen(block2)
.endWith(eosBlock(stats)) // or errorBlock(errors)
That is not specially better than mocks, but has the advantage of not returning null as soon as we call other method and it would mean there is a single place to apply changes in future refactors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but has the advantage of not returning null as soon as we call other method
We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.
it would mean there is a single place to apply changes in future refactors
I didn't fully follow this part though.
It is not something I think we should do in this PR, but I want just to open that discussion for the future.
I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in general against using mocks. In my experience, they are a cheap solution at short term but problematic in the long run. For example, right now we are mocking operators in a lot of different places just to return a well known sequence of blocks. That can be easily done with a new operator instead of using a mock, but in the first time we needed, we just used mocks. Now that kind of mock is being replicated in a lot of different tests. Imagine that tomorrow we add a second method that is sometimes called in operators:
- In the code that use mocks, we need to iterate over all the tests to verify that we add an answer for that new method. We need to iterate over these tests manually because the compiler won't help us here. We can run the tests and see which ones fail, but in case we have tests that try different mocks in the same method (something that is also an anti-pattern, but we have several cases like that) we would need to run tests several times.
- In the code that use a test operator, we just need to implement that method once (in that class) and ideally the compiler help us here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the setCount
change, it LGTM
…tor; update explain name for new operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @gortiz!
*/ | ||
public class MinusAllOperator extends SetOperator { | ||
private static final Logger LOGGER = LoggerFactory.getLogger(MinusAllOperator.class); | ||
private static final String EXPLAIN_NAME = "MINUS ALL"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm no it's not being used there either in fact.
@@ -58,6 +58,7 @@ public String toExplainString() { | |||
|
|||
@Override | |||
protected boolean handleRowMatched(Object[] row) { | |||
return _rightRowSet.remove(new Record(row)); | |||
Record record = new Record(row); | |||
return _rightRowSet.remove(record, _rightRowSet.count(record)) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should mean to execute one less lookup
Good catch, thanks! Just adding a quick analysis on the operations in each approach here for posterity.
The older approach (_rightRowSet.remove(record, _rightRowSet.count(record)) != 0
) has the following operations:
- The explicit call to
count
results in one lookup on the underlying map. - The call to
remove
results in another lookup on the underlying map to get the existing count first in order to decide whether to remove the element from the map completely or simply decrement the count. - Remove the element from the underlying map since we're removing the same number of occurrences as the existing count and return the existing count.
The _rightRowSet.setCount(record, 0) != 0
approach has the following operations:
- Remove the element from the underlying map since we're setting the count to 0, and return the resultant value (from the remove call itself) which represents the existing count.
So, I think your proposed approach actually saves us two lookups!
Mockito.when(_leftOperator.nextBlock()) | ||
.thenReturn(OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{3})) | ||
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0)); | ||
Mockito.when(_rightOperator.nextBlock()).thenReturn( | ||
OperatorTestUtil.block(schema, new Object[]{1}, new Object[]{2}, new Object[]{4})) | ||
.thenReturn(TransferableBlockTestUtils.getEndOfStreamTransferableBlock(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but has the advantage of not returning null as soon as we call other method
We could use a spy or partial mock, but if we're able to use the actual operator without much setup overhead, this definitely makes more sense.
it would mean there is a single place to apply changes in future refactors
I didn't fully follow this part though.
It is not something I think we should do in this PR, but I want just to open that discussion for the future.
I can make this change in a follow-up PR for both these new tests as well as the existing tests for the other operators.
all
modifier #13126 and Multi-stage except ignoresall
modifier #13127SetOperator
has been modified slightly to use a multiset instead of a regular set so that we can support theINTERSECT ALL
andEXCEPT ALL
set operations.SetOperator
class with aMultisetOperator
class with the appropriate abstractions but since the hash multiset overhead isn't too high (essentially just an additional count maintained per element), this option was avoided in favor of the cleaner, uniform implementation.feature
orbugfix
since theINTERSECT ALL
,EXCEPT ALL
set operations currently erroneously return the same results asINTERSECT
,EXCEPT
respectively.Let's take an example to demonstrate the various set operation semantics and their differences. Suppose there's two tables
A
(with one column'a'
) andB
(with one column'b'
) with the following rows:A
B
Then,
SELECT * FROM (SELECT a FROM A) INTERSECT (SELECT b FROM B)
should return:SELECT * FROM (SELECT a FROM A) INTERSECT ALL (SELECT b FROM B)
should return:SELECT * FROM (SELECT a FROM A) EXCEPT (SELECT b FROM B)
should return:and,
SELECT * FROM (SELECT a FROM A) EXCEPT ALL (SELECT b FROM B)
should return: