`to_gbq` fails to create empty table with correct schema #376

ncclementi · 2021-08-12T21:32:04Z

I want to create an empty table that has a specific schema, which in this specific case I'm inferring from a dask dataframe. The table gets created on BigQuery, however it shows with no schema.

Minimal reproducible example:

import pandas_gbq
import dask.dataframe as dd
import random 

#generate some data
def gen_data(size=10):
    records = [
        {
            "name": random.choice(["fred", "wilma", "barney", "betty"]),
            "number": random.randint(0, 100),
            "idx": i,
        }
        for i in range(size)
    ]
    return records

test= gen_data(size=50)
df = pd.DataFrame(test)

ddf = dd.from_pandas(df, chunksize=10)

meta = ddf._meta.dtypes.to_dict()
schema = [{"name":k, "type": v} for k, v in meta.items()]

#the schema looks like this
# [{'name': 'name', 'type': dtype('O')},
# {'name': 'number', 'type': dtype('int64')},
#  {'name': 'idx', 'type': dtype('int64')}]

destination_table="my_dataset.test_table"
project_id="my_project"

pandas_gbq.to_gbq(dataframe=pd.DataFrame(), destination_table=destination_table, 
          project_id=project_id, if_exists="append", table_schema=schema)

But when I check the schema on the table created I get:

Environment details

Programming language: python 3.3.10
OS: MacOS Big Sur 11.4
Package version: pandas-gbq 0.15.0 (pypi)
pandas version: 1.3.1
Dask version: 2021.07.2

tswast · 2021-09-02T15:05:13Z

Thanks for the report! I agree this looks like a bug.

jimfulton · 2021-09-02T18:18:22Z

There are a couple of issues here:

You've provided a dataframe with 0 columns. This makes the table schema you're providing technically invalid. From the pandas-gbq documentation for the table_schema argument: "If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes.". The table_schema you're providing is a superset. :) The documentation should say that extra fields in the schema are ignored, and it would probably be helpful if `to_gbq`` warned about the extra fields. Your resulting schema is empty because the given dataframe has an empty schema.
The types in your table_schema are invalid. They need to be BigQuery type strings, Like "INT64". The documentation isn't very clear about this and could definitely be improved.

In summary, the dataframe you give must have a column for every column you want, and the types in table_schema must be BigQuery types (strings).

ncclementi · 2021-09-02T23:31:05Z

There are a couple of issues here:

You've provided a dataframe with 0 columns. This makes the table schema you're providing technically invalid. From the pandas-gbq documentation for the table_schema argument: "If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes.". The table_schema you're providing is a superset. :) The documentation should say that extra fields in the schema are ignored, and it would probably be helpful if `to_gbq`` warned about the extra fields. Your resulting schema is empty because the given dataframe has an empty schema.

@jimfulton thanks for the reply but I'm a bit confused since I'm passing a schema with all my column names:

schema = [{"name":k, "type": v} for k, v in meta.items()]

When you say "the dataframe you give must have a column for every column you want" ? Does this mean I have to provide the columns when passing the argument in

pandas_gbq.to_gbq(dataframe=pd.DataFrame(columns=[list of col names]), ...)

The types in your table_schema are invalid. They need to be BigQuery type strings, Like "INT64". The documentation isn't very clear about this and could definitely be improved.

Goot to know, I will give this a try.

jimfulton · 2021-09-02T23:59:14Z

When you say "the dataframe you give must have a column for every column you want" ?

You're passing an empty dataframe, pd.DataFrame(), as the dataframe argument. The schema in the table created will have whatever columns the dataframe you pass in has.

Does this mean I have to provide the columns when passing the argument in

No, the dataframe you pass in must have the columns you want.

You could use df and filter out all of the rows.

Alternatively, if you had a list of column names, you could do something like

pd.DataFrame({name: [] for name in names})

jimfulton · 2021-09-03T12:53:23Z

You want something like:

schema = [
    {'name': 'name', 'type': 'STRING'},
    {'name': 'number', 'type': 'int64'},
    {'name': 'idx', 'type': 'int64'},
]
pandas_gbq.to_gbq(
    pd.DataFrame({field['name']: [] for field in schema}),
    destination_table=destination_table, 
    project_id=project_id,
    if_exists="append",
    table_schema=schema)

ncclementi · 2021-09-29T11:46:04Z

@tswast I've confirmed that the suggestions made by @jimfulton work, and this is a problem of documentation. Providing a columns list and having the types of the schema be BIgQuery types strings solves the problem.
Feel free to close this issue when you feel appropriate.

tswast · 2021-09-29T14:41:53Z

Thanks for the update. I'll close this once we clarify the docs.

tswast · 2021-11-10T20:06:34Z

Closed by #383

The table_schema argument in the new docs location now includes the following:

If table_schema is provided, it may contain all or a subset of DataFrame columns. If a subset is provided, the rest will be inferred from the DataFrame dtypes. If table_schema contains columns not in the DataFrame, they’ll be ignored.

ncclementi added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 12, 2021

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Aug 12, 2021

ncclementi mentioned this issue Aug 12, 2021

to_bigquery ideas - no intermediate storage coiled/dask-bigquery#3

Open

jimfulton self-assigned this Sep 2, 2021

jimfulton added type: question Request for information or clarification. Not an issue. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 2, 2021

jimfulton mentioned this issue Sep 2, 2021

Need better documentation for the table_schema parameter to to_gbq. #382

Closed

tswast assigned tswast and unassigned jimfulton Sep 29, 2021

tswast closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_gbq` fails to create empty table with correct schema #376

`to_gbq` fails to create empty table with correct schema #376

ncclementi commented Aug 12, 2021

tswast commented Sep 2, 2021

jimfulton commented Sep 2, 2021

ncclementi commented Sep 2, 2021 •

edited

jimfulton commented Sep 2, 2021

jimfulton commented Sep 3, 2021

ncclementi commented Sep 29, 2021

tswast commented Sep 29, 2021

tswast commented Nov 10, 2021

to_gbq fails to create empty table with correct schema #376

to_gbq fails to create empty table with correct schema #376

Comments

ncclementi commented Aug 12, 2021

Environment details

tswast commented Sep 2, 2021

jimfulton commented Sep 2, 2021

ncclementi commented Sep 2, 2021 • edited

jimfulton commented Sep 2, 2021

jimfulton commented Sep 3, 2021

ncclementi commented Sep 29, 2021

tswast commented Sep 29, 2021

tswast commented Nov 10, 2021

`to_gbq` fails to create empty table with correct schema #376

`to_gbq` fails to create empty table with correct schema #376

ncclementi commented Sep 2, 2021 •

edited