Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlMagic.autolimit doesn't add a limit clause to the query #1006

Open
takikadiri opened this issue Mar 29, 2024 · 3 comments
Open

SqlMagic.autolimit doesn't add a limit clause to the query #1006

takikadiri opened this issue Mar 29, 2024 · 3 comments

Comments

@takikadiri
Copy link

What happens?

We're using jupysql to query a trino cluster through jupyter notebook. I want to autolimit the query results that we query to the cluster, as it's intended to be used for an interactive analytics workload.

I expect the SqlMagic.autolimit to add a LIMIT clause to the query, but it doesn't, i can see the original query in the Trino Admin UI.

To Reproduce

  • Create a trino sqlalchemy connection. trino_connection = create_engine("trino://trino_cluster....")
  • set an autolimit. %config SqlMagic.autolimit=500
  • Use the connection: %sql trino_connection
  • Run a query: %sql SELECT * FROM BIG_TABLE
  • Get the result sets or See the query in the trino UI. You'll get the SELECT * FROM BIG_TABLE without the LIMIT Clause

OS:

Linux

JupySQL Version:

0.10.10

Full Name:

Takieddine KADIRI

Affiliation:

Société Générale Assurances

@edublancas
Copy link

edublancas commented Mar 29, 2024

yeah, the effect of autolimit is not to add a LIMIT to the statement but to send the query and only fetch the top X results, are you seeing performance issues?

you can see the relevant code here.

the main reason not to add LIMIT, is that some DBs don't support it so we'd have to figure out a way to make it portable. I guess it wouldn't be too hard using sqlglot, but given that we support many databases, adding stuff like this ends up breaking features for some users.

perhaps a more reasonable solution would be to add LIMIT only to the dbs that we're sure they support it, but even detecting which db the user is connected to is challenging and we've encountered edge cases.

bottom line is that we've decided not to modify user-submitted queries to prevent these issues. if you have time, feel free to open a PR, perhaps there is another approach that I'm overlooking

@takikadiri
Copy link
Author

takikadiri commented Mar 30, 2024

Ah i see, The doc mislead me. Thank you for the quick feedback.

For the interactive analytics workload, we're also using Apache Superset, that manage to systematically LIMIT the query. That's why i was looking for this in jupysql

Maybe i could use jupysql for a slighly different use case, where the desired behavior is to collect the entire query results for further analysis with pandas or for doing joins with another data sources. I'll tests the user bahaviors and the cluster performance with jupysql and autopandas=True and let you know if the LIMIT clause is really needed here.

@edublancas
Copy link

sounds good, thanks for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants