Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results for filtered aggregates when run through SQL query, this was working on 25.0.0 but found not working on latest release 29.0.0, so has broken somewhere in between #16178

Open
stamboli opened this issue Mar 20, 2024 · 7 comments

Comments

@stamboli
Copy link

SampleSaleData.json
SampleSaleData.csv

Affected Version

29.0.0

Description

  • Steps to reproduce the problem
    Upload the csv being attached, druid spec for it is also attached for reference

Simplest query that can reproduce issues is as follows

SELECT
COUNT(DISTINCT (CASE WHEN (("SampleSaleData"."__time" >= '2022-01-12T00:00:00.000Z') AND ("SampleSaleData"."__time" < '2022-01-13T00:00:00.000Z')) THEN "City" END)) AS "P2-DistinctCities",
COUNT(DISTINCT (CASE WHEN (("SampleSaleData"."__time" >= '2022-01-05T00:00:00.000Z') AND ("SampleSaleData"."__time" < '2022-01-06T00:00:00.000Z')) THEN "City" END)) AS "P1-DistinctCities"
FROM
SampleSaleData "SampleSaleData"

This query which used to work earlier and give results as 4, 8 is now returning 0,0

  • Any debugging that you have already done
    -If this query is run keeping one aggregation at a time, either first one or second one only this still gives correct result, only when both are present it is giving wrong result
@stamboli stamboli changed the title Wrong results for filtered aggregates, this was working on 25.0.0 but found not working on latest release 29.0.0, so has broken somewhere in between Wrong results for filtered aggregates when run through SQL query, this was working on 25.0.0 but found not working on latest release 29.0.0, so has broken somewhere in between Mar 20, 2024
@abhishekagarwal87
Copy link
Contributor

was the approximate distinct count turned off when you ran this query?

@stamboli
Copy link
Author

stamboli commented Mar 20, 2024 via email

@stamboli
Copy link
Author

environment.txt

@abhishekagarwal87
Copy link
Contributor

can you set druid.sql.planner.useGroupingSetForExactDistinct to true and see if that fixes the issue? This bug might be same as what's being discussed here - apache/calcite#3735 (comment)

@abhishekagarwal87
Copy link
Contributor

Though I am surprised how did this query even work in 25.0.0 without you setting druid.sql.planner.useGroupingSetForExactDistinct. It would have failed outright.

@stamboli
Copy link
Author

No luck with this setting too :(
Surprisingly as explained above even without this flag or so single aggregation at a time works

@stamboli
Copy link
Author

Looking at your test case I formed query based on it which works.
SELECT
COUNT(DISTINCT "City") FILTER (WHERE ("SampleSaleData"."__time" >= '2022-01-12T00:00:00.000Z') AND ("SampleSaleData"."__time" < '2022-01-13T00:00:00.000Z')) AS "P2-DistinctCities",
COUNT(DISTINCT "City") FILTER (WHERE ("SampleSaleData"."__time" >= '2022-01-05T00:00:00.000Z') AND ("SampleSaleData"."__time" < '2022-01-06T00:00:00.000Z')) AS "P2-DistinctCities"
FROM
SampleSaleData "SampleSaleData"

But this query, very specific to druid. The solution we are building need to work with multiple DB this query is not working with MySQL as well as Snowflake, the queries are built dynamically, so a very specific druid SQL needs to be built. Until now CASE based query was worked with other traditional standard DBs used to work with druid as well.
So overall now this is failure is specifically related to CASE statements with multiple such aggregations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants