Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error querying a nested struct using UNNEST #308

Open
KevinRizzoTO opened this issue Apr 26, 2024 · 1 comment
Open

Error querying a nested struct using UNNEST #308

KevinRizzoTO opened this issue Apr 26, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@KevinRizzoTO
Copy link

KevinRizzoTO commented Apr 26, 2024

What happened?

I've been using this project locally to mock data from the Google Analytics BigQuery export schema. I've noticied one issue when using the following query:

bq --api http://0.0.0.0:9050 --project_id=test-project query "SELECT e.value.int_value FROM test-project.analytics_0000001.events_20240424 AS tp, UNNEST(tp.event_params) as e"
BigQuery error in query operation: Error processing job 'test-project:bqjob_r49c9de07d8bfce5_0000018f1ad943c3_1': failed to scan rows: failed to convert struct from array &{[0xc001a80e80 0xc001a80ec0 0xc001a80f00 0xc001a80f40]}

The problem seems to stem from how the event_params column is setup. It's an array of nested structs, outlined below in the python example script.

What did you expect to happen?

Expected the query to return a value, instead of an error.

How can we reproduce it (as minimally and precisely as possible)?

Use the following python script to reproduce (assuming you have a project called test-project and a dataset called analytics_0000001 already setup):

from google.cloud import bigquery
from google.api_core.client_options import ClientOptions

client = bigquery.Client(
    project="test-project",
    client_options=(
        ClientOptions(api_endpoint="http://localhost:9050")
    )
)

table_id = "test-project.analytics_0000001.events_1"
table = bigquery.Table(table_id)
table.schema = [
    bigquery.SchemaField(
        "event_params",
        "RECORD",
        mode="REPEATED",
        fields=[
            bigquery.SchemaField("key", "STRING", mode="NULLABLE"),
            bigquery.SchemaField(
                "value",
                "RECORD",
                mode="NULLABLE",
                fields=[
                    bigquery.SchemaField(
                        "string_value", "STRING", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "int_value", "INTEGER", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "float_value", "FLOAT", mode="NULLABLE"
                    ),
                    bigquery.SchemaField(
                        "double_value", "FLOAT", mode="NULLABLE"
                    ),
                ],
            ),
        ],
    ),
]
client.create_table(table, exists_ok=True)

# insert test row
rows_to_insert = [
    {
        "event_params": [
            {
                "key": "param1",
                "value": {"string_value": "value1"},
            },
            {
                "key": "param2",
                "value": {"int_value": 123},
            },
        ],
    }
]
errors = client.insert_rows_json(table_id, rows_to_insert)

# query for row using unnest
query = f"""
SELECT
    event_param.key AS param_key,
    event_param.value.string_value AS param_value_string,
    event_param.value.int_value AS param_value_int,
FROM
    `{table_id}`,
    UNNEST(event_params) AS event_param
"""
query_job = client.query(query)
results = query_job.result() # error occurs here
for row in results:
    print(row)

Anything else we need to know?

Switching the query above to this no longer has an error:

# query for row using unnest
query = f"""
SELECT
    event_param.key AS param_key,
    event_param.value AS param_value
FROM
    `{table_id}`,
    UNNEST(event_params) AS event_param
"""
query_job = client.query(query)
results = query_job.result() # error occurs here
for row in results:
    print(row)

Something about adding the extra nested value (event_param.value.string_value) results in the error.

@KevinRizzoTO KevinRizzoTO added the bug Something isn't working label Apr 26, 2024
@KevinRizzoTO
Copy link
Author

I managed to track the error log to the go-zetasqlite library. It seems like, for some reason, the struct in the event_param.value record is being stored as an array. However, when trying to serialize the value back into a struct this was never supported. I've put a fork together with a change here that unblocks my current workflow. I'm not really sure if this is the correct fix, or maybe the real issue is how this server is passing data to the zetasqlite driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant