feat: STRUCT and ARRAY support #318

jimfulton · 2021-08-30T23:24:32Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #293
Fixes #314
Fixed #233
Fixes #37
🦕

…emy into struct

jimfulton · 2021-08-31T19:26:06Z

FTR, WRT superset, once I finally got it working :), it behaves the same with and without these changes.

BTW, we have logic that tries to unpack sub-structs, I think so that there would eventually be scalars for superset to work with.

If you have an array of structs, we still create columns for the fields of the struct in the array. This causes superset to error, because it has no way to get at structs in an array. We should probably not unpack structs in arrays.

Otherwise, the BQ doesn't handle arrays of structs.

…emy into struct

…ore fix

snippet-bot · 2021-09-01T15:52:22Z

Here is the summary of changes.

You are about to add 10 region tags.

docs/struct.rst:14, tag bigquery_sqlalchemy_create_table_with_struct
docs/struct.rst:34, tag bigquery_sqlalchemy_insert_struct
docs/struct.rst:42, tag bigquery_sqlalchemy_query_struct
docs/struct.rst:50, tag bigquery_sqlalchemy_query_getitem
docs/struct.rst:58, tag bigquery_sqlalchemy_query_STRUCT
samples/snippets/STRUCT.py:23, tag bigquery_sqlalchemy_create_table_with_struct
samples/snippets/STRUCT.py:44, tag bigquery_sqlalchemy_insert_struct
samples/snippets/STRUCT.py:74, tag bigquery_sqlalchemy_query_struct
samples/snippets/STRUCT.py:79, tag bigquery_sqlalchemy_query_STRUCT
samples/snippets/STRUCT.py:84, tag bigquery_sqlalchemy_query_getitem

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

Refresh this comment

I want it narrow to avoid horizonal scrolling

…emy into struct

jimfulton · 2021-09-02T13:48:58Z

Some notes for reviewers:

Heart of change is _struct.py, which isn't large, but also isn't obvious. :( I cribbed from the built-in JSON and ARRAY types. When reviewing, it's probably helpful to look at those. The "Comparator" framework is confusing, in large part because the name doesn't make sense.

Having said that, the core logic is in _setop_getitem, which is a hook used by the base class of STRUCTs comparator.

The __getattr__ method just delegates to (the inherited) __getitem__.
This PR also has 2 other small changes:
- Machinery for mapping BQ types to SQLAlchemy types has been factored into a separate _types module, both to avoid cluttering base more and to partially avoid circular imports. (There's still a circular import issue that isn't fixable without a bigger refactoring that I deemed unwarranted.)
- Implementation of ARRAY indexing, which wasn't implemented. A number of my tests (see test__struct.py in both unit and system tests) used an example that has nested structs and arrays.

tswast

I haven't quite digested everything in this PR yet, but I figured I'd share the feedback I have so far.

docs/struct.rst

setup.py

sqlalchemy_bigquery/_struct.py

Co-authored-by: Tim Swast <swast@google.com>

…emy into struct

tswast

I like it! Just a few things I think we should clarify before merging.

sqlalchemy_bigquery/_struct.py

tswast · 2021-09-08T15:44:37Z

sqlalchemy_bigquery/_struct.py

+        global type_compiler
+
+        try:
+            process = type_compiler.process
+        except AttributeError:
+            type_compiler = base.dialect.type_compiler(base.dialect())
+            process = type_compiler.process


Could we put this in a _get_type_compiler / _get_process function? I don't see anywhere else we initialize type_compiler, but I'd be more comfortable having this logic closer to the # We have to delay getting the type compiler, because of circular imports. :( comment.

I refactored so this is combined and isolated in one place using a new, better named _get_subtype_col_spec function.

tswast · 2021-09-08T15:47:15Z

sqlalchemy_bigquery/_struct.py

+            type_compiler = base.dialect.type_compiler(base.dialect())
+            process = type_compiler.process
+
+        fields = ", ".join(f"{name} {process(type_)}" for name, type_ in self.__fields)


I assume process is able to handle nested arrays/structs?

sqlalchemy_bigquery/_struct.py

tswast · 2021-09-08T15:53:15Z

sqlalchemy_bigquery/_struct.py

+                    f"STRUCT fields can only be accessed with strings field names,"
+                    f" not {name}."
+                )
+            subtype = self.expr.type._STRUCT__byname.get(name.lower())


Where does _STRUCT__byname come from? I'm assuming somewhere from SQLAlchemy, but I'm not getting any results when searching for byname.

Oh, I think I figured it out: https://docs.python.org/3/tutorial/classes.html#private-variables

Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped.

Can we comment about this? I assume we have to do it because we know self.expr.type is a STRUCT, but it's not self.

I refactored this to make name mangling more explicit and consistent, so I don't think comments are needed anymore. See if you agree. :)

I mainly use "private" variables, which aren't :), to avoid namespace conflicts when subclassing across responsibility boundaries. Arguably, explicit naming is better.

tswast · 2021-09-08T16:34:17Z

sqlalchemy_bigquery/_struct.py

+            return operator, index, subtype
+
+        def __getattr__(self, name):
+            if name.lower() in self.expr.type._STRUCT__byname:


I'm a bit confused why self.__byname doesn't work in this case.

Edit: I see now that it's part of the Comparator class. Still probably worth a similar comment to the one I recommend in _setup_getitem

See my response on name mangling

sqlalchemy_bigquery/_struct.py

tswast · 2021-09-08T17:04:00Z

sqlalchemy_bigquery/_types.py

+                for f in field.fields
+            ]
+            results += _get_transitive_schema_fields(sub_fields, cur_fields)
+            cur_fields.pop()


Since we pop these off, does that mean we don't get the top-level struct field, just the leaf fields? I suspect this might hide some ARRAY columns if a parent node has mode REPEATED, but is not included.

Edit: I see the top field is added to results on line 83. Might be worth a comment as to why we pop here.

I got rid of cur_fields. It wasn't needed (anymore).

tswast · 2021-09-08T17:07:50Z

sqlalchemy_bigquery/_types.py

+    for field in fields:
+        results += [field]
+        if field.field_type == "RECORD":
+            cur_fields.append(field)


I don't quite understand what cur_fields is doing. Is there a better name we can pick for this? Maybe it's referring to ancestors?

Haha, it's not doing anything.

This is based on

python-bigquery-sqlalchemy/pybigquery/sqlalchemy_bigquery.py

Lines 503 to 523 in db64424

def _get_columns_helper(self, columns, cur_columns):

"""

Recurse into record type and return all the nested field names.

As contributed by @sumedhsakdeo on issue #17

"""

results = []

for col in columns:

results += [

SchemaField(

name=".".join(col.name for col in cur_columns + [col]),

field_type=col.field_type,

mode=col.mode,

description=col.description,

fields=col.fields,

)

]

if col.field_type == "RECORD":

cur_columns.append(col)

results += self._get_columns_helper(col.fields, cur_columns)

cur_columns.pop()

return results

, which I inherited.

I've refactored it quite a bit and failed to notice that this wasn't needed any more. Fixed.

tswast · 2021-09-08T17:53:48Z

tests/unit/test__struct.py

+        (_col().NAME, "`t`.`person`.NAME"),
+        (_col().children, "`t`.`person`.children"),
+        (
+            _col().children[0].label("anon_1"),  # SQLAlchemy doesn't add the label


Should we file an issue for this to investigate later? If so, let's add TODO and link to the issue.

…Alchemy 1.3 and 1/4

Also, check for both RECORD and STRUCT fild types, in case the API ever starts returning STRUCT.

…accessing array items

Co-authored-by: Tim Swast <swast@google.com>

tswast · 2021-09-09T15:45:08Z

sqlalchemy_bigquery/_struct.py

+    global _get_subtype_col_spec
+
+    type_compiler = base.dialect.type_compiler(base.dialect())
+    _get_subtype_col_spec = type_compiler.process


Fancy! I didn't realize a function could replace itself. I like it.

🤖 I have created a release \*beep\* \*boop\* --- ## [1.2.0](https://www.github.com/googleapis/python-bigquery-sqlalchemy/compare/v1.1.0...v1.2.0) (2021-09-09) ### Features * STRUCT and ARRAY support ([#318](https://www.github.com/googleapis/python-bigquery-sqlalchemy/issues/318)) ([6624b10](https://www.github.com/googleapis/python-bigquery-sqlalchemy/commit/6624b10ded73bbca6f40af73aaeaceb95c381b63)) ### Bug Fixes * the unnest function lost needed type information ([#298](https://www.github.com/googleapis/python-bigquery-sqlalchemy/issues/298)) ([1233182](https://www.github.com/googleapis/python-bigquery-sqlalchemy/commit/123318269876e7f76c7f0f2daa5f5b365026cd3f)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

feat: STRUCT and ARRAY support

52cee8c

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Aug 30, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 30, 2021

jimfulton added 6 commits August 30, 2021 17:24

Merge branch 'main' into struct

a0b02f7

Fixed test that expected JSON rather than STRUCT

6bacc0d

Merge branch 'struct' of github.com:jimfulton/python-bigquery-sqlalch…

1ec0f88

…emy into struct

Added system test I neglected to check in before :(

74aab64

blacken

c5653e2

Merge branch 'main' into struct

a7f0b41

jimfulton added 8 commits August 31, 2021 16:17

Don't strip <ARRAY > from parameter types

9df1804

Otherwise, the BQ doesn't handle arrays of structs.

Added system tests to verift PR 67 and issue 233

0df1701

Merge branch 'struct' of github.com:jimfulton/python-bigquery-sqlalch…

7aad07f

…emy into struct

blacken

f10a571

Renamed test file to conform to samples test-file naming conventions

ec31040

Require google-cloud-bigquery 2.25.2 to get struct field-name undersc…

accf762

…ore fix

Added STRUCT documentation

ef5f891

fix bigquery version

cce9dbb

jimfulton added 7 commits September 1, 2021 10:37

Merge branch 'main' into struct

290d955

get blacken to leave sample code alone.

b697df6

I want it narrow to avoid horizonal scrolling

Check in missing file :(

6a278b9

Merge branch 'struct' of github.com:jimfulton/python-bigquery-sqlalch…

bc62a56

…emy into struct

need sqla 1.4 for unnest

84426bd

fixed typo

587a0f7

Merge branch 'main' into struct

e6f4adf

jimfulton marked this pull request as ready for review September 2, 2021 13:30

jimfulton requested review from a team as code owners September 2, 2021 13:30

jimfulton requested a review from tmatsuo September 2, 2021 13:30

jimfulton added 3 commits September 2, 2021 13:31

Merge branch 'main' into struct

ffb5aa9

Merge branch 'main' into struct

47fa14f

Merge branch 'main' into struct

402bbbe

tswast self-requested a review September 7, 2021 20:03

tswast requested changes Sep 7, 2021

View reviewed changes

jimfulton and others added 5 commits September 7, 2021 14:54

Update sqlalchemy_bigquery/_struct.py

5bf07b4

Co-authored-by: Tim Swast <swast@google.com>

added STRUCT docstring

e937167

Add doc link

8661f5b

Merge branch 'struct' of github.com:jimfulton/python-bigquery-sqlalch…

b550aa1

…emy into struct

Added some comments

af68a54

tswast requested changes Sep 8, 2021

View reviewed changes

jimfulton and others added 7 commits September 8, 2021 13:01

Localize logic for getting subtye column specifications

da43fd2

explain semi-private name mangling

f04cac2

Make name magling more explicit

5af05bb

explain why we have different implementations of _field_index for SQL…

09866c6

…Alchemy 1.3 and 1/4

get rid of cur_fields, we're not using it anymore.

054c227

Also, check for both RECORD and STRUCT fild types, in case the API ever starts returning STRUCT.

Add a todo to find out why Sqlalchemy doesn't generate an alias when …

1a79305

…accessing array items

user repr rather than str to shpow an object in an error message

5e2ae32

Co-authored-by: Tim Swast <swast@google.com>

tswast approved these changes Sep 9, 2021

View reviewed changes

jimfulton merged commit 6624b10 into googleapis:main Sep 9, 2021

jimfulton deleted the struct branch September 9, 2021 15:50

jimfulton mentioned this pull request Sep 9, 2021

Add support for array and struct literals #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: STRUCT and ARRAY support #318

feat: STRUCT and ARRAY support #318

jimfulton commented Aug 30, 2021 •

edited

jimfulton commented Aug 31, 2021

snippet-bot bot commented Sep 1, 2021 •

edited

jimfulton commented Sep 2, 2021 •

edited

tswast left a comment

tswast left a comment

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 8, 2021

jimfulton Sep 8, 2021

tswast Sep 9, 2021

	def _get_columns_helper(self, columns, cur_columns):
	"""
	Recurse into record type and return all the nested field names.
	As contributed by @sumedhsakdeo on issue #17
	"""
	results = []
	for col in columns:
	results += [
	SchemaField(
	name=".".join(col.name for col in cur_columns + [col]),
	field_type=col.field_type,
	mode=col.mode,
	description=col.description,
	fields=col.fields,
	)
	]
	if col.field_type == "RECORD":
	cur_columns.append(col)
	results += self._get_columns_helper(col.fields, cur_columns)
	cur_columns.pop()
	return results

feat: STRUCT and ARRAY support #318

feat: STRUCT and ARRAY support #318

Conversation

jimfulton commented Aug 30, 2021 • edited

jimfulton commented Aug 31, 2021

snippet-bot bot commented Sep 1, 2021 • edited

jimfulton commented Sep 2, 2021 • edited

tswast left a comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimfulton commented Aug 30, 2021 •

edited

snippet-bot bot commented Sep 1, 2021 •

edited

jimfulton commented Sep 2, 2021 •

edited