Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pyarrow.large_* as column type in dataframe upload/ download #1706

Open
cvm-a opened this issue Oct 30, 2023 · 3 comments
Open

Support pyarrow.large_* as column type in dataframe upload/ download #1706

cvm-a opened this issue Oct 30, 2023 · 3 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@cvm-a
Copy link

cvm-a commented Oct 30, 2023

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Is your feature request related to a problem? Please describe.

  1. (P1 because it's so simple) When uploading a dataframe, I get a Pyarrow could not determine the type of columns" warning raised with with pyarrow.large_string() columns. this should be a trivial addition in _ARROW_SCALAR_IDS_TO_BQ.
  2. (P2 because it takes longer to fix ) When downloading results can we support ( maybe even default to using) pyarrow.large_string ( and other pyarrow.large(*) instead of pyarrow.string for
    QueryJob.to_arrow. pyarrow.string has a 2GiB limit on the size of the data in the column (not just in a single element) that's guaranteed to work correctly. If query results are bigger, they might not immediately break because the data is usually chunked smaller, but many dataframe operations ( like aggregations or even indexing) on these columns trigger a "ArrowInvalid: offset overflow" error. This is mainly caused by bad decisions in Arrow ([C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays apache/arrow#33049), but we can try to keep BQ users safe. The performance/ memory hit has usually been small, and 2GiB is very easy to cross.

Describe the solution you'd like

  1. add pyarrow.large_* keys to _ARROW_SCALAR_IDS_TO_BQ
  2. add an option or default to return large_* types in QueryJob.to_arrow

Describe alternatives you've considered
For 2, I have converted the string columns to large_string myself immediately after loading, and it has not triggered issues yet, but the Arrow API does not seem to guarantee that this should continue to work.
Additional context

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Oct 30, 2023
@Linchin Linchin added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Oct 31, 2023
@Gaurang033
Copy link
Contributor

@Linchin does bigquery able to handle more than 2 GB of data in a single cell ?

@cvm-a
Copy link
Author

cvm-a commented Dec 19, 2023 via email

@Linchin
Copy link
Contributor

Linchin commented Feb 3, 2024

There are max cell size for CSV and JSON, but there's no mentioning of other formats (except for total file size < 15TB). I suppose that means there's no limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants