load_table_from_dataframe does not error out when `nan` in a required column - Million dollar bug #1692

gbmarc1 · 2023-10-18T23:50:13Z

When loading a pandas dataframe into Bigquery that contains a nan in a required column, the upload succeeds but the resulting table is not representative of the dataframe. The values in the column containing the nan are unordered and the user is not aware of it.

Environment details

OS type and version:
Python version: Python 3.11.4
pip version: pip 23.2.1
google-cloud-bigquery version:
Name: google-cloud-bigquery

Version: 3.12.0
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
Required-by: google-cloud-aiplatform, pandas-gbq

Steps to reproduce

Run the code below

Code example

from google.cloud import bigquery 
import pandas as pd
import numpy as np

df = pd.DataFrame([["hello", "string"], ["hello2", np.nan], ["hello3", "valid"], ["hello4", "valid2"]], columns=["image_uri", "phash"])

client = bigquery.Client(project="project")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("image_uri", "STRING", mode="REQUIRED"),
        bigquery.SchemaField("phash", "STRING", mode="REQUIRED"),
    ]
)
job = client.load_table_from_dataframe(
    df, "foo.foo_bar", job_config=job_config
)
job.result()

df_read = pd.read_gbq("foo.foo_bar", project_id="project")

The text was updated successfully, but these errors were encountered:

Gaurang033 · 2023-10-23T14:29:41Z

seems like there is an issue with pyarrow library. Following code show the reproduction of the same bug.

    import pyarrow as pa
    import pyarrow.parquet
    import pandas as pd

    fields = [pa.field('id', pa.string(), nullable=False),
              pa.field('name', pa.string(), nullable=False)]
    array = [pa.array(['10', '11', '12', '13']),
             pa.array(['AAA', None, 'BBB', 'CCC'])]
    table = pa.Table.from_arrays(array, schema=pa.schema(fields))
    pyarrow.parquet.write_table(table, 'test_arrow.parquet', compression='SNAPPY', use_compliant_nested_type=True)
    df = pd.read_parquet("/Users/fki/Documents/git/Demo/bq_api/test_arrow.parquet", engine='pyarrow')
    print("\n\n\n")
    print(df)

Possible solution

until issue is fixed in Pyarrow. Disable the nullability for schema in pyarrow table. This will throw unhandled exception though.

_pandas_helper.py

def bq_to_arrow_field(bq_field, array_type=None):
    """Return the Arrow field, corresponding to a given BigQuery column.

    Returns:
        None: if the Arrow type cannot be determined.
    """
    arrow_type = bq_to_arrow_data_type(bq_field)
    if arrow_type is not None:
        if array_type is not None:
            arrow_type = array_type  # For GEOGRAPHY, at least initially
        # is_nullable = bq_field.mode.upper() == "NULLABLE"
        is_nullable = True

Exception if remove nullability

        if self._exception is not None:
            # pylint: disable=raising-bad-type
            # Pylint doesn't recognize that this is valid in this case.
>           raise self._exception
E           google.api_core.exceptions.BadRequest: 400 Required field phash cannot be null

Gaurang033 · 2023-10-23T16:23:01Z

other way to fix this is to verify if data has null when column has data type as required. created the pull request for that.

Linchin · 2023-11-03T00:43:25Z

@Gaurang033 Thank you so much for your investigation, and proposing a solution! It looks like pyarrow did not check whether the arrays are consistent with schema when Table.from_arrays() is called, because table = pa.Table.from_arrays(array, schema=pa.schema(fields))(in your reproduction example) yields the following without an error:

   id  name
0  10   AAA
1  11  None
2  12   BBB
3  13   CCC

It feels like a major neligence on the pyarrow side. I wonder if it's a bug, or it's done by design.

At the same time, we have an umbrella issue #1646, which covers a bunch of issues related to the load_table_from_*() methods. We may need a more comprehensive look into them all, in case this patch would break other user paths.

Gaurang033 · 2023-11-03T02:06:39Z

@Linchin totally agree. Seems like a bug from pyarrow library as I can reproduce it without using bigquery. if you think we need this patch until pyarrow fix this, let's merge this. I fixed the failing test case. when we provided unknown data type arrow field was coming as null. and so method I added was throwing exception.

chalmerlowe · 2023-11-19T14:01:22Z

I added a similar comment on the PR #1698:

I worry about spending too much time and energy trying to create a work around for what we all agree is a problem in pyarrow. This feels like it creates greater complexity in our code, increased fragility, and a higher maintenance burden in the long run.

Am I missing something?

I am inclined to close this issue as out of scope for us.

tswast · 2023-11-22T15:27:31Z

Given that arrow is intended for fast conversions, I think the correct fix for us is to stop using REQUIRED/NULLABLE from BigQuery to determine our local arrow schema. Instead, always set nullable=True locally.

…values (#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #1692 🦕

…values (googleapis#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes googleapis#1692 🦕

…nto pandas extra (#1726) * feat: Introduce compatibility with native namespace packages * Update copyright year * removed pkg_resources from all test files and moved importlib into pandas extra * feat: removed pkg_resources from all test files and moved importlib into pandas extra * Adding no cover tag to test code * reformatted with black * undo revert * perf: use the first page a results when `query(api_method="QUERY")` (#1723) * perf: use the first page a results when `query(api_method="QUERY")` * add tests * respect max_results with cached page * respect page_size, also avoid bqstorage if almost fully downloaded * skip true test if bqstorage not installed * coverage * fix: ensure query job retry has longer deadline than API request deadline (#1734) In cases where we can't disambiguate API failure from job failure, this ensures we can still retry the job at least once. * fix: `load_table_from_dataframe` now assumes there may be local null values (#1735) Even if the remote schema is REQUIRED Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #1692 🦕 * chore: standardize samples directory - delete unneeded dependencies (#1732) * chore: standardize samples directory = delete unneeded dependencies * Removed unused import for linter * fix: move grpc, proto-plus and protobuf packages to extras (#1721) * chore: move grpc, proto-plus and protobuff packages to extras * formatted with black * feat: add `job_timeout_ms` to job configuration classes (#1675) * fix: adds new property and tests * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * updates docs to correct a sphinx failure * Updates formatting * Update tests/system/test_query.py * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * Update google/cloud/bigquery/job/base.py * updates one test and uses int_or_none * Update tests/system/test_query.py testing something. * Update tests/system/test_query.py * testing coverage feature * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * minor edits * tweaks to noxfile for testing purposes * add new test to base as experiment * adds a test, updates import statements * add another test * edit to tests * formatting fixes * update noxfile to correct debug code * removes unneeded comments. --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> --------- Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Swast <swast@google.com> * remove unnecessary version checks * undo bad commit, remove unneeded version checks * Revert "undo bad commit, remove unneeded version checks" This reverts commit 5c82dcf. * Revert "remove unnecessary version checks" This reverts commit 9331a7e. * revert bad changes, remove pkg_resources from file * after clarification, reimplement changes and ignore 3.12 tests * reformatted with black * removed minimum check * updated pandas installed version check --------- Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Tim Swast <swast@google.com> Co-authored-by: Chalmer Lowe <chalmerlowe@google.com> Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 18, 2023

This was referenced Oct 23, 2023

Feature/fix 1692 #1697

Closed

fix: load_table_from_dataframe method for issue 1692 #1698

Closed

Linchin added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Oct 31, 2023

Linchin mentioned this issue Nov 6, 2023

Revisit the method load_table_from_json() #1646

Open

chalmerlowe closed this as completed Nov 22, 2023

tswast reopened this Nov 22, 2023

tswast mentioned this issue Nov 22, 2023

fix: load_table_from_dataframe now assumes there may be local null values #1735

Merged

4 tasks

gcf-merge-on-green bot closed this as completed in #1735 Nov 22, 2023

tswast self-assigned this Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_table_from_dataframe does not error out when `nan` in a required column - Million dollar bug #1692

load_table_from_dataframe does not error out when `nan` in a required column - Million dollar bug #1692

gbmarc1 commented Oct 18, 2023

Gaurang033 commented Oct 23, 2023

Gaurang033 commented Oct 23, 2023

Linchin commented Nov 3, 2023

Gaurang033 commented Nov 3, 2023

chalmerlowe commented Nov 19, 2023

tswast commented Nov 22, 2023

load_table_from_dataframe does not error out when nan in a required column - Million dollar bug #1692

load_table_from_dataframe does not error out when nan in a required column - Million dollar bug #1692

Comments

gbmarc1 commented Oct 18, 2023

Environment details

Steps to reproduce

Code example

Gaurang033 commented Oct 23, 2023

Possible solution

Exception if remove nullability

Gaurang033 commented Oct 23, 2023

Linchin commented Nov 3, 2023

Gaurang033 commented Nov 3, 2023

chalmerlowe commented Nov 19, 2023

tswast commented Nov 22, 2023

load_table_from_dataframe does not error out when `nan` in a required column - Million dollar bug #1692

load_table_from_dataframe does not error out when `nan` in a required column - Million dollar bug #1692