Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_gbq fails to create empty table with correct schema #376

Closed
ncclementi opened this issue Aug 12, 2021 · 8 comments
Closed

to_gbq fails to create empty table with correct schema #376

ncclementi opened this issue Aug 12, 2021 · 8 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: question Request for information or clarification. Not an issue.

Comments

@ncclementi
Copy link

I want to create an empty table that has a specific schema, which in this specific case I'm inferring from a dask dataframe. The table gets created on BigQuery, however it shows with no schema.

Minimal reproducible example:

import pandas_gbq
import dask.dataframe as dd
import random 

#generate some data
def gen_data(size=10):
    records = [
        {
            "name": random.choice(["fred", "wilma", "barney", "betty"]),
            "number": random.randint(0, 100),
            "idx": i,
        }
        for i in range(size)
    ]
    return records

test= gen_data(size=50)
df = pd.DataFrame(test)

ddf = dd.from_pandas(df, chunksize=10)

meta = ddf._meta.dtypes.to_dict()
schema = [{"name":k, "type": v} for k, v in meta.items()]

#the schema looks like this
# [{'name': 'name', 'type': dtype('O')},
# {'name': 'number', 'type': dtype('int64')},
#  {'name': 'idx', 'type': dtype('int64')}]

destination_table="my_dataset.test_table"
project_id="my_project"

pandas_gbq.to_gbq(dataframe=pd.DataFrame(), destination_table=destination_table, 
          project_id=project_id, if_exists="append", table_schema=schema)

But when I check the schema on the table created I get:

Screen Shot 2021-08-12 at 5 30 39 PM

Environment details

  • Programming language: python 3.3.10
  • OS: MacOS Big Sur 11.4
  • Package version: pandas-gbq 0.15.0 (pypi)
  • pandas version: 1.3.1
  • Dask version: 2021.07.2
@ncclementi ncclementi added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 12, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Aug 12, 2021
@tswast
Copy link
Collaborator

tswast commented Sep 2, 2021

Thanks for the report! I agree this looks like a bug.

@jimfulton jimfulton self-assigned this Sep 2, 2021
@jimfulton
Copy link
Contributor

There are a couple of issues here:

  1. You've provided a dataframe with 0 columns. This makes the table schema you're providing technically invalid. From the pandas-gbq documentation for the table_schema argument: "If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes.". The table_schema you're providing is a superset. :) The documentation should say that extra fields in the schema are ignored, and it would probably be helpful if `to_gbq`` warned about the extra fields. Your resulting schema is empty because the given dataframe has an empty schema.
  2. The types in your table_schema are invalid. They need to be BigQuery type strings, Like "INT64". The documentation isn't very clear about this and could definitely be improved.

In summary, the dataframe you give must have a column for every column you want, and the types in table_schema must be BigQuery types (strings).

@jimfulton jimfulton added type: question Request for information or clarification. Not an issue. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 2, 2021
@ncclementi
Copy link
Author

ncclementi commented Sep 2, 2021

There are a couple of issues here:

  1. You've provided a dataframe with 0 columns. This makes the table schema you're providing technically invalid. From the pandas-gbq documentation for the table_schema argument: "If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes.". The table_schema you're providing is a superset. :) The documentation should say that extra fields in the schema are ignored, and it would probably be helpful if `to_gbq`` warned about the extra fields. Your resulting schema is empty because the given dataframe has an empty schema.

@jimfulton thanks for the reply but I'm a bit confused since I'm passing a schema with all my column names:

schema = [{"name":k, "type": v} for k, v in meta.items()]

When you say "the dataframe you give must have a column for every column you want" ? Does this mean I have to provide the columns when passing the argument in

pandas_gbq.to_gbq(dataframe=pd.DataFrame(columns=[list of col names]), ...)
  1. The types in your table_schema are invalid. They need to be BigQuery type strings, Like "INT64". The documentation isn't very clear about this and could definitely be improved.

Goot to know, I will give this a try.

@jimfulton
Copy link
Contributor

When you say "the dataframe you give must have a column for every column you want" ?

You're passing an empty dataframe, pd.DataFrame(), as the dataframe argument. The schema in the table created will have whatever columns the dataframe you pass in has.

Does this mean I have to provide the columns when passing the argument in

No, the dataframe you pass in must have the columns you want.

You could use df and filter out all of the rows.

Alternatively, if you had a list of column names, you could do something like

pd.DataFrame({name: [] for name in names})

@jimfulton
Copy link
Contributor

You want something like:

schema = [
    {'name': 'name', 'type': 'STRING'},
    {'name': 'number', 'type': 'int64'},
    {'name': 'idx', 'type': 'int64'},
]
pandas_gbq.to_gbq(
    pd.DataFrame({field['name']: [] for field in schema}),
    destination_table=destination_table, 
    project_id=project_id,
    if_exists="append",
    table_schema=schema)

@ncclementi
Copy link
Author

@tswast I've confirmed that the suggestions made by @jimfulton work, and this is a problem of documentation. Providing a columns list and having the types of the schema be BIgQuery types strings solves the problem.
Feel free to close this issue when you feel appropriate.

@tswast
Copy link
Collaborator

tswast commented Sep 29, 2021

Thanks for the update. I'll close this once we clarify the docs.

@tswast tswast assigned tswast and unassigned jimfulton Sep 29, 2021
@tswast
Copy link
Collaborator

tswast commented Nov 10, 2021

Closed by #383

The table_schema argument in the new docs location now includes the following:

If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes. If table_schema contains columns not in the DataFrame, they’ll be ignored.

@tswast tswast closed this as completed Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants