New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Native] - Query with correlated subquery in the exists subclause in where condition is returning incorrect results #22585
Comments
Tried to run the query manually in single coordinator + multiple workers environment but it didn't reproduce there but the problem reproduces in the test environment every time. |
I am able to repro the isuse locally:
|
@karteekmurthys @mknegi @aditi-pandit seems like we should create an issue in Velox? |
@tdcmeehan I agree with that. I need to come up with a minimal repro so I can open this issue in Velox. |
@mknegi : These are 2 separate observations. Karteek is saying multi-drivers per node. Whereas your case has multiple workers in the cluster. Can you try multiple workers with one/multiple drivers per task ? If the issue is only in the multiple worker/multiple driver case, then we have a cleaner isolation. |
Query with correlated subquery in the exists subclause in where condition is returning incorrect results. Here is the testcase that fails due to this problem:
Expected Behavior
java.lang.AssertionError: For query:
SELECT count(*) FROM orders o WHERE EXISTS(SELECT 1 FROM orders i WHERE o.orderkey < i.orderkey AND i.orderkey % 1000 = 0)
not equal
Actual rows (1 of 1 extra rows shown, 1 rows in total):
[31762]
Expected rows (1 of 1 missing rows shown, 1 rows in total):
[14999]
If you see how the query in this scenario is constructed using the correlated subquery it should have only reported count(*) as 14999 because the orders table only have 15000 records in total. So, this query reporting so many extra rows shows that the semantics of correlated subquery aren’t somehow working in this case and that should generally be a defect.
I ran the select * from orders instead of select count(*) in the testcase and it seems there are lots of duplicate rows in the query output. This may explain why the row count was so much more than 14999 in this case.
The text was updated successfully, but these errors were encountered: