Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

Open
ivanpugachtd opened this issue Dec 28, 2021 · 5 comments
Open
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ivanpugachtd
Copy link

ivanpugachtd commented Dec 28, 2021

Hi!
there is a problem when trying to load using pandas-gbq which using pyarrow a column of the list (array) or dictionary (json) type into the table, while the GBQ documentation says that structure types such as array or json are supported,

df = pd.DataFrame(
                {
                    "my_string": ["a", "b", "c"],
                    "my_int64": [1, 2, 3],
                    "my_float64": [4.0, 5.0, 6.0],
                    "my_bool1": [True, False, True],
                    "my_bool2": [False, True, False],
                    "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
                }
            )
pandas_gbq.to_gbq(df, **gbq_params)

as a result, a stacktrace error occurs:

  • in bq_to_arrow_array
  • return pyarrow.Array.from_pandas(series, type=arrow_type)
  • File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
  • File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
  • File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  • File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
  • pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

Can anyone help with it please?

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Dec 28, 2021
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 29, 2021
@tswast
Copy link
Collaborator

tswast commented Jan 4, 2022

Is this writing to an existing table? Could you share the schema of the destination table?

@tswast tswast added type: question Request for information or clarification. Not an issue. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 4, 2022
@tswast tswast self-assigned this Jan 4, 2022
@ivanpugachtd
Copy link
Author

Hi, @tswast
I am trying to upload data to the new table, more precisely I tried both
versions:
pyarrow==6.0.1
pandas-gbq==0.16.0

In fact I was able to upload data, only if I using json.dumps() on the column which has list or dict type in there

@grzesir
Copy link

grzesir commented Jan 18, 2022

any updates on this? getting the same error. the strange thing is that the code works well locally and in compute engine, but fails in cloud run (even though the same service account is being used for both)

@tswast tswast added the priority: p3 Desirable enhancement or fix. May not be included in next release. label Jan 19, 2022
@tswast
Copy link
Collaborator

tswast commented Jan 19, 2022

I am trying to upload data to the new table, more precisely I tried both

Ah, that probably explains it. Currently, pandas-gbq attempts to determine a schema locally based on the dtypes it detects. It likely gets this wrong for the struct/array data.

I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

@nabor-slalom-greenparksports
Copy link

Has there been any progress on updating this issue? I am seeing the same error message.

Could we elaborate on:

I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

As I am seeing the same issue even with a created table, and using (if_exists='replace'):

pandas_gbq.to_gbq(dataframe, table_id, project_id=project_id, if_exists='replace')

The work-around that helped me to successfully load my table was casting the dataframe column to string data type.

As an example GCP Cloud Function:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This produces the error mentioned in this thread:

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

With requirements.txt as

pandas==1.4.1
pandas-gbq==0.17.4

When pushing the column casting I added a single line and ended up with:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  # Column conversion added to load table
  df['my_struct'] = df['my_struct'].astype("string")

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This helps to successfully load the table into BigQuery with schema:

Field name Type
my_string STRING
my_int64 INTEGER
my_float64 FLOAT
my_bool1 BOOLEAN
my_dates TIMESTAMP
my_struct STRING

If you need the my_struct to be an actual struct consider:

SELECT
  *
   # retrieve value from struct
  ,json_value(my_struct, '$.test') AS test
   # recreate struct using value for each row
  ,struct(json_value(my_struct, '$.test') AS test) AS my_created_struct
FROM `project-id.dataset.table` order by my_int64
Row my_string my_int64 my_float64 my_bool1 my_dates my_struct test my_created_struct.test
1 a 1 4.0 true 2022-03-24 04:14:28.267319 UTC {'test': 'str1'} str1 str1
2 b 2 5.0 false 2022-03-25 04:14:28.267319 UTC {'test': 'str2'} str2 str2
3 c 3 6.0 true 2022-03-26 04:14:28.267319 UTC {'test': 'str3'} str3 str3

@meredithslota meredithslota added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Feb 7, 2023
@tswast tswast removed their assignment Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

6 participants