Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow load_table_from_dataframe to Ignore Extra Schema Fields #1812

Open
ArnoldHueteG opened this issue Feb 7, 2024 · 5 comments
Open

Allow load_table_from_dataframe to Ignore Extra Schema Fields #1812

ArnoldHueteG opened this issue Feb 7, 2024 · 5 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ArnoldHueteG
Copy link

Description:

Environment details

OS: MacOS Sonoma 14.1.1
Python version: 3.10
google-cloud-bigquery version: 3.17.1

Steps to reproduce

Create a BigQuery schema with additional fields not present in the DataFrame.
Use load_table_from_dataframe with the defined schema to load data into BigQuery.

from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd

credentials = service_account.Credentials.from_service_account_file(
    'credentials.json')
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

json_data = [
    {
        "name": "Alice",
        "age": 30
    },
    {
        "name": "Bob",
        "age": 25
    }
]

df = pd.DataFrame(json_data)

job_config = bigquery.LoadJobConfig(schema=[
    bigquery.SchemaField("name", "STRING"),
    bigquery.SchemaField("age", "INTEGER"),
    bigquery.SchemaField("field_not_present", "INTEGER"),
],
job_timeout_ms=5)
load_job = client.load_table_from_dataframe(
    df, "xenon-world-399922.oss.your_table",
    job_config=job_config
)
load_job.result()

Current behavior
Currently, when using load_table_from_dataframe from the Python BigQuery client, if the provided schema contains fields that are not present in the DataFrame, a ValueError is raised:
ValueError: bq_schema contains fields not present in dataframe: {'field_not_present'}.

Expected behavior

In contrast to the command line behavior when loading JSON data into a BigQuery table, the Python client currently requires a strict match between the DataFrame columns and the provided schema. This behavior can be limiting, as the command line tool does not enforce this match when loading json data.

I propose that load_table_from_dataframe be enhanced to allow a more flexible schema matching, similar to the command line tool's behavior. Specifically, it should not raise an error if the schema contains additional fields not present in the DataFrame. This would allow for more versatile data loading scenarios where the DataFrame might not always have the complete set of fields defined in the BigQuery table schema.

Use case
This feature would be particularly useful in scenarios where the DataFrame is dynamically generated and might not always contain the full set of fields as per the BigQuery schema. Allowing the function to ignore extra schema fields would enable more flexible and robust data loading operations.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Feb 7, 2024
@Linchin Linchin added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Feb 8, 2024
@Linchin
Copy link
Contributor

Linchin commented Feb 8, 2024

Thank you @ArnoldHueteG for this feature request, it would indeed allow for for more flexible data loading. Could you clarify on the use case here - is it for an existing table, and the dataframe to load may lack certain columns, although its schema is provided? More specifically, should the extra column be present already, and if not, should it be added?

@ArnoldHueteG
Copy link
Author

the enhancement I am proposing is for the load_table_from_dataframe function to proceed with data loading even when certain schema columns are missing in the DataFrame, and automatically assign null values to these missing, nullable fields.

@aaaaahaaaaa
Copy link

is it for an existing table, and the dataframe to load may lack certain columns, although its schema is provided?

We do have this exact use case. It would indeed be great to have that flexibility.

@Linchin Linchin self-assigned this Mar 8, 2024
@Linchin
Copy link
Contributor

Linchin commented Mar 8, 2024

Sorry for the late reply. The reason why we enforced that the dataframe must contain every column in the schema, was to make it easier to catch typos in the schema. So essentially there are two conflicting corner cases we want to handle. Maybe we can make the error message a warning instead? WDYT @tswast?

@tswast
Copy link
Contributor

tswast commented Mar 8, 2024

A warning for missing fields sounds like a good solution to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants