Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for array and struct literals #67

Closed
wants to merge 2 commits into from

Conversation

JacobHayes
Copy link
Contributor

@JacobHayes JacobHayes commented Sep 26, 2020

This allows building arrays and structs up in sa.select statements from other columns. The array is just ripped right from the postgres dialect with a custom visit_array. The struct have their own type and column expression. I didn't touch the existing RECORD type (which is just JSON) but instead made a custom STRUCT type that supports lookups to access subfields - not sure how to reconcile this with the existing RECORD type. Is it as easy as just changing _type_map's value to STRUCT instead of JSON?

These have been useful for pivots (ie: array of struct("{X}" as name, {col} as value)).

NOTE: due to struct fields requiring labels, things break when not using literal_binds due to #39, but work fine after #47. I have an integration branch here if it is useful in the meantime. #47 was merged

@JacobHayes
Copy link
Contributor Author

JacobHayes commented Sep 26, 2020

Simple select example:

import sqlalchemy as sa
from pybigquery import array, struct
from pybigquery.sqlalchemy_bigquery import BigQueryDialect

engine = sa.create_engine("bigquery://<your connstr here>")
engine.execute(
    """
    create table tmp.test_struct as (
        select 'a' as id, struct(1 as x__count, 2 as y__count, 3 as z__count) as dimensions
    )
    """
)

table = sa.Table("tmp.test_struct", sa.MetaData(bind=engine), autoload=True)
# This would obviously be cleaner if the RECORD built these natively.
dimensions = struct([], table.c.dimensions.name)
query = sa.select([table.c.id, (dimensions["x__count"] + 5).label("x")])

print(query.compile(dialect=BigQueryDialect(), compile_kwargs={"literal_binds": True}))

Example doing a pivot:

import sqlalchemy as sa
from pybigquery import array, struct
from pybigquery.sqlalchemy_bigquery import BigQueryDialect

engine = sa.create_engine("bigquery://<your connstr here>")
engine.execute(
    """
    create table tmp.test as (
        select 1 as x__count, 2 as y__count, 3 as z__count
    )
    """
)

table = sa.Table("tmp.test", sa.MetaData(bind=engine), autoload=True)
pivot_unnest = sa.func.unnest(
    array(
        [
            struct([sa.literal(dim).label("dimension"), col.label("count")])
            for dim, col in {
                "x": table.c.x__count,
                "y": table.c.y__count,
                "z": table.c.z__count,
            }.items()
        ]
    )
).alias("pivot")
pivot = struct([sa.column("dimension"), sa.column("count")], field=pivot_unnest.name)
query = sa.select(
    [pivot["dimension"].label("dimension"), pivot["count"].label("count")]
).select_from(table.join(pivot_unnest, sa.literal(True)))

print(query.compile(dialect=BigQueryDialect(), compile_kwargs={"literal_binds": True}))

It's a little bit awkward since they're not table objects (so no .c and instead do getitem style lookup). It seems like we may be able to mock that though with a little more Comparator work.

A few other things that'd be nice:

  • STRUCT typing (ie: track name + types) - we should get this from both the BQ reflection and any dynamic struct(...) calls.
  • validate field access against that metadata and determine return value datatype (including typed sub-structs)

@tswast tswast self-requested a review January 7, 2021 15:32
@tswast
Copy link
Collaborator

tswast commented Jan 7, 2021

It might make sense to support ARRAY first, as there's already a data type for it in SQLAlchemy. https://docs.sqlalchemy.org/en/13/core/type_basics.html#sql-standard-and-multiple-vendor-types

Have you seen any discussion on adding STRUCT/RECORD types to SQLAlchemy for other backends? I'd want to make sure we do something that aligns with the direction of the rest of the community.

@JacobHayes
Copy link
Contributor Author

JacobHayes commented Jan 26, 2021

@tswast makes sense - the ARRAY datatype is handled with #64, correct? I don't see an array literal (sa.select instead of sa.Table) except the postgres one, which might be because "currently only the PostgreSQL backend has support for SQL arrays in SQLAlchemy". I'm happy to pull out the visit_array and array import (or would you rather copy the class array?) if you'd prefer.

Re a STRUCT/RECORD datatype - I didn't find anything on a cursory look at the sqla Google Group, they might be even more rare than ARRAY datatypes. 😁

Perhaps I can start a discussion in the Google Groups about these things.

@jimfulton
Copy link
Contributor

@JacobHayes wrt array literal, why not just literal([2, 6])

I guess this gets more exciting with structs, and especially arrays of structs.

@JacobHayes
Copy link
Contributor Author

@jimfulton oh yeah, hadn't thought of literal. That does seem to work for simple values, but not for column references:

import sqlalchemy as sa
from pybigquery import array, struct
from pybigquery.sqlalchemy_bigquery import BigQueryDialect

engine = sa.create_engine("bigquery://")
engine.execute(
    """
    create or replace table tmp.test_struct as (
        select 'a' as id, struct(1 as x__count, 2 as y__count, 3 as z__count) as dimensions
    )
    """
)

table = sa.Table("tmp.test_struct", sa.MetaData(bind=engine), autoload=True)

print(list(engine.execute(sa.select([sa.literal(["a"])]))))
# [(['a'],)]
engine.execute(sa.select([sa.literal([table.c.id])]))
# ProgrammingError: (google.cloud.bigquery.dbapi.exceptions.ProgrammingError) Encountered unexpected first array element of parameter param_1, cannot determine array elements type.
# [SQL: SELECT %(param_1)s AS `anon_1`]
# [parameters: {'param_1': [Column('id', String(), table=<tmp.test_struct>)]}]
# (Background on this error at: http://sqlalche.me/e/13/f405)

print(list(engine.execute(sa.select([array(["a"])]))))
# [(['a'],)]
print(list(engine.execute(sa.select([array([table.c.id])]))))
# [(['a'],)]

@jimfulton
Copy link
Contributor

Thanks. It's really helpful to have specific examples!

@jimfulton
Copy link
Contributor

FTR, IMO a big (surprising) part of the value proposition of SQLAlchemy is as an abstraction layer over disparate database engines. For that reason, I'd like to enable as much as possible without introducing dialect-specific APIs.

@JacobHayes
Copy link
Contributor Author

Absolutely agree there! The array literal side at least matches postgres, but structs seem a bit more unique (though not so much for data warehouses). Perhaps upstream can be convinced to move the array constructor from the postgres dialect to common and (maybe harder) add a struct?

Additionally, a proper STRUCT column type would be great so we can start generating DDL for more complex tables (JSON won't do as it doesn't define the subfields).

@jimfulton
Copy link
Contributor

@JacobHayes please checkout https://github.com/googleapis/python-bigquery-sqlalchemy/pull/318/files#diff-ee8781d1be11dbad05f4936052659eed168b4ebeaef166b9e2684f74293e30f8R94-R183

and let me know if I satisfied what you were trying to do.

I didn't really follow your pivot example, but I'm hoping I covered it in test_unnest_and_struct_access_233.

@jimfulton
Copy link
Contributor

Superseded by #318

@jimfulton jimfulton closed this Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants