Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dt.time objects upload with incorrect time or fail when using bigquery.Client().load_table_from_dataframe #382

Closed
willbowditch opened this issue Nov 11, 2020 · 3 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@willbowditch
Copy link

Uploading datetime.time objects via bigquery.Client().load_table_from_dataframe fails or generates incorrect times.

Testing a single time (dt.time(12,0)) will upload, but the result in bigquery is incorrect (10:03:46.765952), but the example below fails with an out of expected range error. However, conversion to parquet and back works fine.

My guess is that the time stamp generated for parquet is in us but BigQuery expects ns?

Thanks.

Environment details

  • OS type and version: OS X 10.15.7
  • Python version: 3.6.10
  • pip version: 20.2.2
  • google-cloud-bigquery version: 2.3.1

Code example

import pandas as pd

df = pd.DataFrame(
    dict(
        start_time=[
            dt.time(0, 0),
            dt.time(0, 15),
            dt.time(0, 30),
            dt.time(0, 45),
            dt.time(1, 0),
            dt.time(1, 30),
            dt.time(10, 0),
            dt.time(10, 30),
            dt.time(11, 0),
            dt.time(11, 30),
            dt.time(12, 0),
            dt.time(12, 30),
            dt.time(12, 5),
            dt.time(13, 0),
            dt.time(13, 15),
            dt.time(13, 30),
            dt.time(14, 0),
            dt.time(14, 30),
            dt.time(14, 5),
            dt.time(15, 0),
            dt.time(15, 30),
            dt.time(16, 0),
            dt.time(16, 30),
            dt.time(17, 0),
            dt.time(17, 30),
            dt.time(18, 0),
            dt.time(18, 30),
            dt.time(19, 0),
            dt.time(19, 30),
            dt.time(2, 0),
            dt.time(2, 30),
            dt.time(20, 0),
            dt.time(20, 30),
            dt.time(21, 0),
            dt.time(21, 30),
            dt.time(22, 0),
            dt.time(22, 30),
            dt.time(23, 0),
            dt.time(23, 30),
            dt.time(3, 0),
            dt.time(3, 30),
            dt.time(3, 50),
            dt.time(4, 0),
            dt.time(4, 30),
            dt.time(5, 0),
            dt.time(5, 30),
            dt.time(6, 0),
            dt.time(6, 30),
            dt.time(7, 0),
            dt.time(7, 15),
            dt.time(7, 30),
            dt.time(7, 45),
            dt.time(8, 0),
            dt.time(8, 30),
            dt.time(9, 0),
            dt.time(9, 30),
        ]
    )
)

# Sense check that conversion to parquet and back works...
import pyarrow.parquet as pq
import pyarrow as pa

arrow_table = pa.Table.from_pandas(df)
arrow_table

pq.write_table(arrow_table, "test.parquet")
back_to_pandas = pq.read_table("test.parquet").to_pandas()
back_to_pandas["start_time"].apply(type)
back_to_pandas

from pandas.testing import assert_frame_equal
assert_frame_equal(back_to_pandas, df). # <-- this is fine

from google.cloud import bigquery
bq = bigquery.Client()
job = bq.load_table_from_dataframe(df, "wb_dev_us.time_test")
job.result()  #<-- This errors

Stack trace

---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
~/projects/demand-model2/uk_demand_model/extract/mdd_extract.py in 
      466 bq = bigquery.Client()
      467 job = bq.load_table_from_dataframe(df, "wb_dev_us.time_test_2")
----> 468 job.result()

~/.local/share/virtualenvs/demand-model2-OBCa31Zf/lib/python3.6/site-packages/google/cloud/bigquery/job/base.py in result(self, retry, timeout)
    629 
    630         kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry}
--> 631         return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
    632 
    633     def cancelled(self):

~/.local/share/virtualenvs/demand-model2-OBCa31Zf/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout, retry)
    132             # pylint: disable=raising-bad-type
    133             # Pylint doesn't recognize that this is valid in this case.
--> 134             raise self._exception
    135 
    136         return self._result

BadRequest: 400 Error while reading data, error message: Invalid time value 30600000000 for column 'start_time': generic::out_of_range: Cannot return an invalid time value of 30600000000 microseconds relative to the Unix epoch. The range of valid time values is [00:00:00, 23:59:59.999999]
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 11, 2020
@tswast
Copy link
Contributor

tswast commented Nov 11, 2020

Unfortunately, TIME is not a supported data type by the BigQuery backend Parquet support. https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions

I see there is a public issue on the backend for this issue here: https://issuetracker.google.com/169230812

@tswast tswast added external This issue is blocked on a bug with the actual product. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed external This issue is blocked on a bug with the actual product. labels Nov 11, 2020
@tswast
Copy link
Contributor

tswast commented Dec 21, 2020

With #399, I believe time columns are supported when the data type is set to CSV in the job configuration.

@willbowditch
Copy link
Author

Thanks @tswast, can confirm the above example now works with release 2.6.2

For anyone finding this issue in future, config to support loads with datetime.time columns is:

bq = bigquery.Client()
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.CSV)
job = bq.load_table_from_dataframe(df, "wb_dev_us.time_test", job_config=job_config)
job.result()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

2 participants