Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't use Pandas to upload a REPEATED field (e.g. list of strings) #913

Closed
emma-brainlabs opened this issue Aug 25, 2021 · 2 comments · Fixed by #925
Closed

Can't use Pandas to upload a REPEATED field (e.g. list of strings) #913

emma-brainlabs opened this issue Aug 25, 2021 · 2 comments · Fixed by #925
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@emma-brainlabs
Copy link

I am trying to add a list of strings stored in a pandas Dataframe to a BigQuery table with a REPEATED field. When running this code:

import pandas as pd
from google.cloud import bigquery
from google.oauth2 import service_account

df = pd.DataFrame([{"repeated": ["hi", "hello"], "not_repeated": "a_string"}])

table = bigquery.Table(
    "project.dataset_name.table_name",
    schema=[
        bigquery.SchemaField("repeated", "string", "REPEATED"),
        bigquery.SchemaField("not_repeated", "string", "NULLABLE"),
    ],
)

bigquery_client = bigquery.Client(
    credentials=service_account.Credentials.from_service_account_file(
        "service-account-credentials.json"
    )
)
bigquery_client.insert_rows_from_dataframe(table, df)

I get this error:

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    bigquery_client.insert_rows_from_dataframe(table, df)
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3433, in insert_rows_from_dataframe
    result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3381, in insert_rows
    json_rows = [_record_field_to_json(schema, row) for row in rows]
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 3381, in <listcomp>
    json_rows = [_record_field_to_json(schema, row) for row in rows]
  File "/Users/emmacombes/.local/share/virtualenvs/bq-stats-sAw4GWcD/lib/python3.7/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 800, in dataframe_to_json_generator
    if pandas.isna(value):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Which stops the execution, and does not allow the code to upload to bigquery. I can confirm that if I run the same code without the list element (aka. df = pd.DataFrame([{"not_repeated": "a_string"}]), the error does not occur.

I think this can be traced back to the recently changed line if pandas.isna(value): from this previous PR (use pandas function to check for NaN #750) to solve this previous issue (dataframe_to_json_generator doesn't support pandas.NA type #729 ). As evaluating pandas.isna(value) on a list will give an array of bools, which can then not be interpreted by the if statement.

I can confirm that if I go to an older version of this library before this change was made, the code works.

Environment details

  • OS type and version: MacOS BigSur 11.5.2
  • Python version: Python 3.7.5
  • pip version: pip 19.2.3
  • google-cloud-bigquery version: 2.24.0
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 25, 2021
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 26, 2021
@plamut
Copy link
Contributor

plamut commented Aug 26, 2021

Thanks for reporting this and doing some initial research, it indeed sounds like a regression.

@plamut plamut self-assigned this Aug 26, 2021
@plamut
Copy link
Contributor

plamut commented Aug 27, 2021

Update: I can confirm the issue. The code sample works in v2.20, but breaks in v2.21.0+. I'll investigate why the tests didn't catch that.

This PR broke it: #750.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants