Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampling extension support to PRQL #4425

Open
hsight opened this issue Apr 28, 2024 · 3 comments
Open

Add sampling extension support to PRQL #4425

hsight opened this issue Apr 28, 2024 · 3 comments

Comments

@hsight
Copy link

hsight commented Apr 28, 2024

What's up?

It would be great to handle table sampling in PRQL.

Examples:

  • DuckDB has this
    • A term usually before ORDER BY/LIMIT
  • Clickhouse has this
    • A term after a table FROM definition
  • MySQL/Postgres do not seem to have an option.
@kgutwin
Copy link

kgutwin commented May 1, 2024

I really like this idea, it could come in handy!

I wondered if it would be possible to approximate this using a PRQL function; that might allow you to try it out before it's implemented natively within PRQL. I put this function together which works! (sorta... see below)

let sample = func
  nrows <float>
  tbl <relation>
  -> <relation> (
  from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)

from invoices
sample 20
select { customer_id }

This outputs the following SQL:

WITH table_0 AS (
  SELECT
    *
  FROM
    invoices USING SAMPLE 20
)
SELECT
  customer_id
FROM
  table_0

-- Generated by PRQL compiler version:0.11.3 (https://prql-lang.org)

However, I start to get strange errors if I modify the pipeline preceding the sample call. If I move the select { customer_id } before the sample 20, I get the error:

from invoices
select { customer_id }
sample 20

Error: 
   ╭─[:9:1]
   │
 9 │ select { customer_id }
   │ ───────────┬──────────  
   │            ╰──────────── unexpected ``(Select ...)``
   │ 
   │ Help: this is probably a 'bad type' error (we are working on that)
───╯

Perhaps I'm using the S-string wrong somehow... does anyone have suggestions?

@max-sixty
Copy link
Member

@kgutwin great use of s-strings!

The one change to fix that error is:

let sample = func
-  nrows <float>
+  nrows <int>
  tbl <relation>
  -> <relation> (
  from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)

I agree sample is useful. There are a lot of options for sample in DuckDB, which makes it a bit harder to add (though still tractable...). Possibly we could add it to the docs as a nice example of s-strings initially.

Would be very open to a PR for either!

@aljazerzen
Copy link
Member

We need to do a survey of supported features in these databases and come up with an interface that will allow (eventually) supporting all of those features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants