Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table_from_dataframe -- support alternative serialization formats #383

Closed
tswast opened this issue Nov 11, 2020 · 3 comments · Fixed by #399
Closed

load_table_from_dataframe -- support alternative serialization formats #383

tswast opened this issue Nov 11, 2020 · 3 comments · Fixed by #399
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Nov 11, 2020

Currently, pandas-gbq doesn't use load_table_from_dataframe because it is using CSV for serialization, which has better support for TIME, DATE, DATETIME than Parquet. See: #56, #382

Struct / Array support could also be improved (though this is partially due to problems in the pyarrow library). #19

I propose supporting the CSV serialization type.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Nov 11, 2020
@tswast tswast self-assigned this Nov 11, 2020
@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Nov 11, 2020
@tswast tswast assigned cguardia and unassigned tswast Nov 17, 2020
@tswast
Copy link
Contributor Author

tswast commented Nov 17, 2020

To go from DataFrame to CSV, we can use the same tempfile logic as we have for parquet, but instead use to_csv + a few parameters to ensure proper data types:

https://github.com/pydata/pandas-gbq/blob/ac2d2fe4ac0025109f8df3723e3f03a337face94/pandas_gbq/load.py#L16-L30

@tswast
Copy link
Contributor Author

tswast commented Nov 17, 2020

This would also require adding CSV to the allowed serialization formats here:

if job_config.source_format != job.SourceFormat.PARQUET:
raise ValueError(
"Got unexpected source_format: '{}'. Currently, only PARQUET is supported".format(
job_config.source_format
)
)

@cguardia
Copy link
Contributor

@tswast I have the initial work for this in #399. Looking for comments about which are the best parameters to use in the to_csv call, and whether this is going in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants