Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window functions do not work in query with aggregate #259

Open
frtennis1 opened this issue Jan 1, 2024 · 0 comments · May be fixed by #260
Open

Window functions do not work in query with aggregate #259

frtennis1 opened this issue Jan 1, 2024 · 0 comments · May be fixed by #260

Comments

@frtennis1
Copy link
Contributor

This is half bug half feature request. The following snippet does not currently work,

x = np.random.normal(scale=50, size=100)
y = x + np.random.normal(scale=5, size=len(x))
df = pd.DataFrame({'x_val': x, 'y': y})

MosaicWidget(
    {
        "plot": [
            {
                "mark": "ruleX",
                "y1": {"agg": "avg(y) - stddev(y)/sqrt(count(y))"},
                "y2": {"agg": "avg(y) + stddev(y)/sqrt(count(y))"},
                "x": {"expr": "ntile(10) over (order by x_val)"},
                "marker": "tick",
                "data": {"from": "df"},
            }
        ]
    },
    data={'df': df},
)

That's because the generated sql is problematic,

> SELECT avg(y) - stddev(y)/sqrt(count(y)) AS "y1", avg(y) + stddev(y)/sqrt(count(y)) AS "y2", ntile(10) over (order by x_val) AS "x" FROM "df" AS "source" GROUP BY "x"
Binder Error: GROUP BY clause cannot contain window functions!

It'd be pretty nice if the above chart snippet could work because the intent seems unambiguous to me.

A suggestion for making it work would be to modify the query generation to move the non-aggregate clause computation into a "WITH" statement where there are no restrictions on window functions, i.e.

WITH "__mosaicTemp" AS (SELECT *, ntile(10) over (order by x_val) AS "x" FROM "df") SELECT avg(y) - stddev(y)/sqrt(count(y)) AS "y1", avg(y) + stddev(y)/sqrt(count(y)) AS "y2", "x" FROM "__mosaicTemp" GROUP BY "x"

which I think should not be a perf cost in cases where it's not needed because duckdb will optimize it away (?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant