New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multistage] Subqueries return different results depending on where clause #12949
Comments
Can you share the whole logical plan for these 2 queries? |
Here the two completet query plans:
Without the filter:
Enclosed in "**" the 2 lines that differs between the 2 plans. Please note: the "**" is not part of the plan, it should be bold which does not work in code.
Thanks |
@gortiz Can you help take a look at this query? |
This is quite a complicated query, but is it possible that with the extra filter, the following filter returns false thus less sessions returned?
I don't see how table c is joined here |
to reduce the complexity of the execution plan I removed the subselect in the select. This subselect is not necessary but was there only for debugging.
The query plan is:
The overall goal is to compute the Bytes sent and received by a device in a defined timeframe from freeradius accounting data. We are still in the POC phase to see if Pinot is the rigth tool for it |
This might have the same root cause and may be easier to debug.
returns 185096 records
returns 466790 records.
pretty simple and stright forward.
has 3 additional joins |
I was able to replicate a similar plan in MultiStageQuickStart using: select
playerID,
numberOfGamesAsBatter,
playerName,
runs,
intentionalWalks,
CAST(hits as BIGINT) as Zeit
from baseballStats a
join dimBaseballTeams b on a.teamID = b.teamID
where not exists ( select 1 from baseballStats c where c.teamID = a.teamID and c.yearID < a.yearID)
order by Zeit desc Which generates
But at lest with the numbers we have in this quickstart, the rows returned by this version are smaller than the number returned without the where. I'll take a look next week trying to understand the rules that have been applied to generate that plan. Meanwhile, can you try to use LAST_WITH_TIME instead? IICU your query, that should be semantically equivalent to:
|
Thanks for the query.
But it is not the same as the second query with the "not exists" clause. Please correct me if I am wrong. The overall goal is to find matching Start-Stop Records though they have the same "acctuniquesessionid". The key is the timestamp column "ts". Simply joining both tables will return 4 records.
And the next step is to kick out the pair where Start from the first session is combined with the Stop of the second session.
Which returns the desired pairs: Removing the filter for acctuniquessionid still shows the right records but the number of records returned differ by the factor 2.5 depending on the where clause |
We are pretty blind here given we don't know in which stage the extra rows are being generated. @RalfJL could you compile and run your case with the code I have in #12704? That PR is quite long but as you can see in the description, we are adding there a new stats system that can be used to list information per operation. It would be super useful to understand where the problem is being generated. |
Sure I will do that. |
We plan to have this PR merged soon (maybe next week) which would mean you will have a docker image ready to use. But in order to accelerate the discoveries, you can fetch my code and compile yourself, but luckily we already include an script to do so. Specifically, in any branch, you can execute: cd docker/images/pinot
./docker-build.sh 1.2.0-multi-stage-stats multi-stage-stats https://github.com/gortiz/pinot.git AFAIR the only dependency is to be able to build docker images (and have internet access). Once that command finishes (it will take some mins) you should have a docker image called |
Hmmm, |
That looks strange, AFAIK we didn't change that part of the code since 1.1.0. But good to know, maybe we can catch where the problem was fixed thanks to that.
It should be in the returned JSON. If you execute the query using the Pinot UI (aka the controller web interface) you should be able to read that by clicking on Show JSON format. Then the table below will be changed with a (probably very large) json that will contain the data of your response and other metadata like BTW, I've just merged my branch, so the next version published docker image tagged with latest will contain my changes. IIRC they are published every day (or maybe even more than once every day) |
Very strange! I can't reproduce it. Neither with the current stable 1.1.0 nor with the repository 1.2.0 from gortiz. I suggest to close this thread with reason "not reproducable". Because I am not done yet with the application we are trying to build, I will be on it for the next 3 or 4 weeks. And if I find that problem again I can reopen this thread. By the way, I confirm that the execution plans are almost the same. But the changes look to me not to be relevant. |
were you able to use the |
I found the stageStats json element but currently I do not really have an idea of what to look for or pay attention to. |
Different result set on successive calls of the same SQL. This might be the same root cause as above.
and the number of rows returned differ between 10 and 24. So sometimes it returns 10 immediatley after that 17 rows, immediatley after it 24 rows than mabye 10 again or maybe 14, ...
compared to
The database is currently static. No rows added or deleted. |
Currently I am investigating if Pinot can help us to calculate inbytes and outbytes from Radius accounting data.
The setup is:
sysbench (produces artificial accounting data) -> freeradius -> kafka -> pinot
The randomness of the artificial data is low so we end up with several "sessions" having the same Acct-Unique-Session-ID.
e.g. query:
returns:
Obviously there are two different sessions with the same acctuniquesessionid in the data.
In real life we can not guarantee that Session ID's are always unique, so this might also happen in real life data.
The solution is to filter out all 'Start' - 'Stop' combinations where there is a 'Start' or a 'Stop' in between.
And here comes the problem:
query:
Returns:
The two sessions as expected.
Removing the filter a.acctuniquesessionid = 'da0f86138ec36b2364889f048bf2ac82' shows a very different result
query:
result (shortened by thousands of other sessions):
So here obviously both subqueries return wrong results. There is a 'Start' and a 'Stop' between timestamp 1712759668126 and 1712050245468 but it is not detected anymore when the filter for the acctuniquessionid is removed
Am I missing here something or am I hitting a bug in multistage engine?
I am using Pinot 1.1.0 as docker compose image (single instance of all server processes)
The query plan for both queries is the same beside the Logical Filter
first query:
LogicalFilter(condition=[AND(=($14, _UTF-8'da0f86138ec36b2364889f048bf2ac82'), =($12, _UTF-8'Stop'))])
second query:
LogicalFilter(condition=[=($12, _UTF-8'Stop')])
PLEASE NOTE: Some filters are redundant like "b.acctuniquesessionid = c.acctuniquesessionid and a.acctuniquesessionid = c.acctuniquesessionid" and are a test if the result changes. But it doesn't
The text was updated successfully, but these errors were encountered: