Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest Function In BigQuery SQLAlchemy Plugin Joins Twice When Selecting Other Columns in CTE or Subquery #368

Closed
mgtcardenas opened this issue Oct 28, 2021 · 11 comments · Fixed by #924
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@mgtcardenas
Copy link

From a customer support request:

Issue summary: This issue is closely related to Unnest function in BigQuery SQLAlchemy plugin joins twice when selecting other columns #353. After the previous issue was resolved, queries using the UNNEST function that are not in Common Table Expressions (CTE) or subqueries work as expected. See the following minimal example:

from sqlalchemy import create_engine, MetaData, select, func as F, Table, Column, Integer, ARRAY

metadata = MetaData()
table = Table(
  "table1",
  metadata,
  Column("foo", Integer),
  Column("bars", ARRAY(Integer))
)

engine = create_engine("bigquery://", future=True)
q = select(table.c.foo, F.unnest(table.c.bars).column_valued("bar"))
print(q.compile(engine))

This outputs the following, which is expected:

SELECT `table1`.`foo`, `bar` 
FROM `table1`, unnest(`table1`.`bars`) AS `bar`

However, if we use the table as a subquery or CTE, the issue manifests again. The following code (continuing from above):

q = q.cte("cte")
q = select(*q.columns)
print(q.compile(engine))

Produces the next output:

WITH `cte` AS 
(SELECT `table1`.`foo` AS `foo`, `bar` 
FROM `table1`, `table1` `table1_1`, unnest(`table1_1`.`bars`) AS `bar`)
 SELECT `cte`.`foo`, `cte`.`bar` 
FROM `cte`

Note that table1 is duplicated in the FROM clause again.

Expected behavior: It is expected to generate the following query instead (note the different FROM clause):

WITH `cte` AS 
(SELECT `table1`.`foo`, `bar` 
FROM `table1`, unnest(`table1`.`bars`) AS `bar`)
 SELECT `cte`.`foo`, `cte`.`bar` 
FROM `cte`
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. label Oct 28, 2021
@tswast tswast added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Oct 28, 2021
@tswast
Copy link
Collaborator

tswast commented Oct 28, 2021

@tseaver Would you mind looking into this? Seems we need a follow-up to #361 to account for sub-queries and table-valued expressions.

@tseaver
Copy link
Contributor

tseaver commented Oct 28, 2021

PR #369 adds a fix for the CTE case. I tried to write a failing test using subquery, but didn't manage to come up with a coherent one.

I have a feeling that this is going to be whack-a-mole salad (e.g., nesting CTEs is likely still broken).

tswast pushed a commit that referenced this issue Oct 29, 2021
@tswast
Copy link
Collaborator

tswast commented Oct 29, 2021

I took a closer looks at this logic. The multiple "FROM" logic is technically a cross-join https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#comma_cross_join

That is probably what we want, since it "flattens" the array, but I wonder if we can make that cross-join more explicit to avoid issues like this? Perhaps we make unnest produce some kind of table-like object and force users to join themselves?

@kwohlfahrt
Copy link

Hi, I originally submitted the request, let me know if I can do anything to help.

The linked PR fixed the issue as reported:

q = select(table.c.foo, F.unnest(table.c.bars).column_valued("bar"))
q = q.cte("cte")
q = select(*q.columns)
print(q.compile(engine))
WITH `cte` AS 
(SELECT `table1`.`foo` AS `foo`, `bar` 
FROM `table1`, unnest(`table1`.`bars`) AS `bar`)
 SELECT `cte`.`foo`, `cte`.`bar` 
FROM `cte`

but...

I have a feeling that this is going to be whack-a-mole salad (e.g., nesting CTEs is likely still broken).

q = q.cte("cte2")
q = select(*q.columns)
print(q.compile(engine))
WITH `cte` AS 
(SELECT `table1`.`foo` AS `foo`, `bar` 
FROM `table1`, `table1` `table1_1`, unnest(`table1_1`.`bars`) AS `bar`), 
`cte2` AS 
(SELECT `cte`.`foo` AS `foo`, `cte`.`bar` AS `bar` 
FROM `cte`)
 SELECT `cte2`.`foo`, `cte2`.`bar` 
FROM `cte2`

yup.

Perhaps we make unnest produce some kind of table-like object and force users to join themselves?

It's definitely possible that I'm using this wrong, and if there is a more explicit way to express this I'm happy to use that. FWIW the Postgres backend seems to be OK with this though:

pg_engine = create_engine("postgresql://", future=True)
print(q.compile(pg_engine))
WITH cte AS 
(SELECT table1.foo AS foo, bar 
FROM table1, unnest(table1.bars) AS bar), 
cte2 AS 
(SELECT cte.foo AS foo, cte.bar AS bar 
FROM cte)
 SELECT cte2.foo, cte2.bar 
FROM cte2

@tswast
Copy link
Collaborator

tswast commented Nov 1, 2021

@kwohlfahrt thanks for the examples and the note regarding Postgres.

@mgtcardenas
Copy link
Author

Hello team! This issue has been open for some time now. I just wanted to make a comment to bring some visibility to the issue. @kwohlfahrt you mentioned that the linked PR fixed the issue as reported but you have another use case where you are uncertain whether you are using the product wrong. @tseaver how could we resolve this situation?

@tseaver tseaver removed their assignment Dec 29, 2021
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Mar 29, 2022
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 26, 2022
@meredithslota meredithslota added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 28, 2022
@yoshi-automation yoshi-automation removed the 🚨 This issue needs some love. label Apr 28, 2022
@lazjake
Copy link

lazjake commented Jun 27, 2022

👋 any progress on this one? sadly running into the nested cte issue 😢

@set92
Copy link

set92 commented Nov 18, 2022

Since this look completely stagnant, someone knows some workaround it? Is possible to write the UNNEST(table.col) AS x as text in the FROM?

I tried using the solution of @kwohlfahrt but didn't work because I need to know the values inside the struct, so I had to use table_valued().
sel = select(func.UNNEST(table.c[self.output_attribute_id]).table_valued(*self.input_attribute_ids)).cte("cte") _compile_query(sel)

But it returns

SELECT `table_1`.`oo_single`, `anon_1`.`pp_single`, `anon_1`.`pp_total` 

FROM
  `table` `table_1`,
  UNNEST(`table_1`.`pp_metrics`) AS `anon_1`

which basically duplicates the data.

@jlynchMicron
Copy link

I am getting this issue when trying to unnest an array of structs, similar to this issue: #376

@evakill
Copy link

evakill commented Oct 24, 2023

@lazjake oh hi - still an issue 🫠

@snapiri
Copy link
Contributor

snapiri commented Nov 26, 2023

IMHO the original problematic code was fixed by #839, though there is still a scenario that is not handled:

When creating a statement using JOIN + UNNEST and then using it as a subquery for a new SELECT statement, it causes the table used in the UNNEST to be specified twice in the FROM statement, thus creating implicit JOIN between that table to itself.

Example code:

from sqlalchemy import ARRAY, Column, create_engine, func, Integer, MetaData, Table, column, select

db_metadata = MetaData()
engine = create_engine("bigquery://", future=True)

column_defs1 = [Column("id", Integer, nullable=True)]
column_defs2 = [Column("ids", ARRAY(Integer), nullable=True), Column("dummy", Integer, nullable=True)]

table1 = Table("table1", db_metadata, *column_defs1)
table2 = Table("table2", db_metadata, *column_defs2)

unnested_col_name = "unnested_ids"
unnested_ids = func.unnest(table2.c.ids).alias(unnested_col_name)
unnested_id_col = column(unnested_col_name)

q = select(table1.c.id, table2.c.dummy).select_from(unnested_ids.join(table1, table1.c.id == unnested_id_col))
print('\nRaw query\n---------')
print(str(q))
print('\nCompiled query\n--------------')
print(str(q.compile(engine)))

# print('\nNew query\n---------')
# # THIS CAUSES THE ISSUE
q = select("*").select_from(q.subquery())
print('\nRaw query\n---------')
print(str(q))
print('\nCompiled query\n--------------')
compiled = q.compile(engine)
print(str(compiled))

Which prints the following:

Raw query
---------
SELECT table1.id, table2.dummy
FROM table2, unnest(table2.ids) AS unnested_ids JOIN table1 ON table1.id = unnested_ids

Compiled query
---------
SELECT `table1`.`id`, `table2`.`dummy`
FROM `table2`, unnest(`table2`.`ids`) AS `unnested_ids` JOIN `table1` ON `table1`.`id` = `unnested_ids`

Raw query
---------
SELECT *
FROM (SELECT table1.id AS id, table2.dummy AS dummy
FROM table2, unnest(table2.ids) AS unnested_ids JOIN table1 ON table1.id = unnested_ids) AS anon_1

Compiled query
---------
SELECT *
FROM (SELECT `table1`.`id` AS `id`, `table2`.`dummy` AS `dummy`
FROM `table2`, `table2` `table2_1`, unnest(`table2_1`.`ids`) AS `unnested_ids` JOIN `table1` ON `table1`.`id` = `unnested_ids`) AS `anon_1`1`

Thanks to @yuval-bavli for helping with pinpointing the scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-sqlalchemy API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

12 participants