You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This change removed support for post-lookup aggregates that use DISTINCT (e.g. COUNT(DISTINCT …)) because our implementation was incorrect.
Consider the following example:
CREATE TABLE t (x int, y int)
INSERT INTO t (x, y) VALUES (1, 1), (2, 1)
which gives us the table
x | y
-----
1 | 1
2 | 1
The query SELECT COUNT(DISTINCT y) FROM t WHERE x > 0 should return 1, since there is only one distinct value for y across x = 1 and x = 2; however, Readyset returns 2. The graph for this query looks something like this:
Base --> Distinct[y over values of x] --> Count[x] --> Reader
The distinct node contains one row for each value of x, and the count node contains a count of 1 for each of these values of x. What is not reflected in the count node is that the counts for each value of x actually include overlapping values of y (i.e. y = 1 is reflected across both values of x). When the reader node is queried for x > 0, it sums the counts across all the values of x in that range, which means we end up double-counting y = 1.
We could probably resolve this by compiling queries with distinct aggregates and range keys to look something like this:
Base --> Distinct --> Reader
and then computing the count at read time. That would allow us to de-duplicate rows across multiple keys.
We should also investigate other potential strategies.
Change in user-visible behavior
Yes
Requires documentation change
Yes
The text was updated successfully, but these errors were encountered:
Description
This change removed support for post-lookup aggregates that use
DISTINCT
(e.g.COUNT(DISTINCT …)
) because our implementation was incorrect.Consider the following example:
which gives us the table
The query
SELECT COUNT(DISTINCT y) FROM t WHERE x > 0
should return1
, since there is only one distinct value fory
acrossx = 1
andx = 2
; however, Readyset returns2
. The graph for this query looks something like this:The distinct node contains one row for each value of x, and the count node contains a count of 1 for each of these values of
x
. What is not reflected in the count node is that the counts for each value of x actually include overlapping values of y (i.e.y = 1
is reflected across both values ofx
). When the reader node is queried forx > 0
, it sums the counts across all the values ofx
in that range, which means we end up double-countingy = 1
.We could probably resolve this by compiling queries with distinct aggregates and range keys to look something like this:
and then computing the count at read time. That would allow us to de-duplicate rows across multiple keys.
We should also investigate other potential strategies.
Change in user-visible behavior
Yes
Requires documentation change
Yes
The text was updated successfully, but these errors were encountered: