Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Server Side Cursors (a.k.a. stream results) #407

Open
jlynchMicron opened this issue Feb 1, 2022 · 4 comments
Open

Add support for Server Side Cursors (a.k.a. stream results) #407

jlynchMicron opened this issue Feb 1, 2022 · 4 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@jlynchMicron
Copy link

jlynchMicron commented Feb 1, 2022

Is your feature request related to a problem? Please describe.
Please support the sqlalchemy stream_results cursor feature to break up large queries that would overwhelm a systems memory.

Describe the solution you'd like
An implementation similar to this post: https://pythonspeed.com/articles/pandas-sql-chunking/

import pandas as pd
from sqlalchemy import create_engine

def process_sql_using_pandas():
    engine = create_engine(
        "postgresql://postgres:pass@localhost/example"
    )
    conn = engine.connect().execution_options(
        stream_results=True)

    for chunk_dataframe in pd.read_sql(
            "SELECT * FROM users", conn, chunksize=1000):
        print(f"Got dataframe w/{len(chunk_dataframe)} rows")
        # ... do something with dataframe ...

if __name__ == '__main__':
    process_sql_using_pandas()

Describe alternatives you've considered
Currently I have to perform an upload after every SELECT DISTINCT query I do to make sure my next query "chunk" will not grab the same row elements.

Related Issue in Pandas: pandas-dev/pandas#35689

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Feb 1, 2022
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Feb 2, 2022
@meredithslota meredithslota added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed triage me I really want to be triaged. labels Feb 3, 2022
@jlynchMicron
Copy link
Author

jlynchMicron commented Feb 9, 2022

It appears that this feature actually works and is implemented by this function:

Someone please correct me if im wrong. I believe I have it working in a project I am using, but I do not have a good way to verify that my machine is not pulling down an entire query first, then just iterating through it.

@tswast
Copy link
Collaborator

tswast commented Feb 28, 2022

The DB-API works as you suggest, pulling down only a page at a time as needed, but I'm unsure how this interacts with the SQLAlchemy connector. Can keep this open for further investigation.

@balajivenkatesh
Copy link

@jlynchMicron Would you be able to share an example of using create_cursor to read in chunks?

@jlynchMicron
Copy link
Author

@balajivenkatesh This is what I think you roughly need to do and create_cursor happens under the hood, in this pandas example:

conn:sa.engine.Connection
    with engine.connect() as conn:
        conn.execution_options(stream_results=True)
        df:pd.DataFrame
        for df in pd.read_sql(sql, engine, chunksize=query_chunksize):
            #Read chunks here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants