Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometry column #400

Open
andhuang-CLGX opened this issue Sep 29, 2021 · 4 comments
Open

Geometry column #400

andhuang-CLGX opened this issue Sep 29, 2021 · 4 comments
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@andhuang-CLGX
Copy link

andhuang-CLGX commented Sep 29, 2021

https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html lists to_gbq, but I think it just inherits pandas-gbq; is there any plans to support a geometry column here or should this be in geopandas?

Right now I think the geometry column gets converted to string in the schema

@andhuang-CLGX andhuang-CLGX added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Sep 29, 2021
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Sep 29, 2021
@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 30, 2021
@tswast
Copy link
Collaborator

tswast commented Sep 30, 2021

Now that we have some Geometry support in the google-cloud-bigquery library, I think this is a reasonable request.

We may just need to pass geometry_as_object=True if available (necessary library versions are installed). https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.table.RowIterator.html#google.cloud.bigquery.table.RowIterator.to_dataframe

@brews
Copy link

brews commented Mar 4, 2022

I think I just hit a regression because of this. Loading a shapefile into BigQuery with geopandas worked back on August 5th, 2021 but now fails with

...
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/core/frame.py", line 2054, in to_gbq
    gbq.to_gbq(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas/io/gbq.py", line 212, in to_gbq
    pandas_gbq.to_gbq(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 1148, in to_gbq
    connector.load_data(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/gbq.py", line 565, in load_data
    chunks = load.load_chunks(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 237, in load_chunks
    load_parquet(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/pandas_gbq/load.py", line 129, in load_parquet
    client.load_table_from_dataframe(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 2671, in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 591, in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 534, in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
  File "/Users/User/miniconda3/envs/segment_wgts/lib/python3.9/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 292, in bq_to_arrow_array
    return pyarrow.Array.from_pandas(series, type=arrow_type)
  File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Input object was not a NumPy array

some of the conda-forge packages in use:

...
geopandas                 0.10.2             pyhd8ed1ab_1    conda-forge
geopandas-base            0.10.2             pyha770c72_1    conda-forge
google-cloud-bigquery     2.34.1             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-core 2.34.1             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-storage 2.11.0             pyh6c4a22f_0    conda-forge
google-cloud-bigquery-storage-core 2.11.0             pyh6c4a22f_0    conda-forge
pandas-gbq                0.17.1             pyh6c4a22f_0    conda-forge
pyarrow                   6.0.1           py39h4d6536f_5_cpu    conda-forge
shapely                   1.8.0            py39hbfbc381_5    conda-forge
...

Looks like geometry is getting mapped to a string here, and so all the juicy GEOGRAPHY-detecting code in python-bigquery gets ignored.

Is fixing this a matter of getting this mapping to map geopandas types to GEOGRAPHY? It looks like python-bigquery is doing something similar to detect pandas types here but is using dtype.name instead of dtype.kind as a key — that might be a more robust approach?

I might be able to put together a PR if I can get guidance or confirmation on this from someone in the know.

@brews
Copy link

brews commented Mar 9, 2022

I got good help elsewhere and wanted to follow up with a solution/workaround.

My issue can be resolved by converting geopandas' geometry column to a WKT column in a vanilla Pandas DataFrame and then passing the vanilla DataFrame to BQ.

    working_shapefile = pd.DataFrame({
        col: (shapefile[col] if col != 'geometry' else [g.wkt for g in shapefile[col].values])
        for col in shapefile.columns.values
    })

Sorry for the spam. Hopefully, this will help someone in the future.

@Riezebos
Copy link

Riezebos commented Jun 2, 2022

Here is some extra input which might be helpful in case geometry support will be added.

The solution by @brews worked for a few rows in my case, I needed to manually specify the schema to get it to parse the geometry column as GEOGRAPHY. This is what I used:

import geopandas as gpd
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"

df = gpd.read_file("my_file")

#determine schema
type_dict = {
    'b' : 'BOOLEAN',
    'i' : 'INTEGER',
    'f' : 'FLOAT',
    'O' : 'STRING',
    'S' : 'STRING',
    'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]

#https://cloud.google.com/bigquery/docs/pandas-gbq-migration#loading_a_pandas_dataframe_to_a_table
job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
    df.to_wkt(), #same output as github issue solution
    table_id, 
    job_config=job_config
)
job.result()

This worked for a small dataset but for a larger set I quickly started running into all kinds of errors where bq would not accept the polygons:

Invalid polygon loop: Edge 462 has duplicate vertex with edge 554
Invalid nesting: loop 1 should not contain loop 0

I found a potential solution using geojson instead of wkt: https://stackoverflow.com/questions/62233152/uploading-to-bigquery-gis-invalid-nesting-loop-1-should-not-contain-loop-0

After quite some experimentation I found a way to create a df that seemed acceptable to bq similar to the solution above:

df_json = pd.DataFrame({
    col: (df[col] if col != 'geometry' else df[col].map(lambda x: json.dumps(shapely.geometry.mapping(x))))
    for col in df
})

This led to another similar error as before:

Invalid geography value for column 'geometry', error: Polygon's first loop must be shell. It is nested in loop 2

I then found that bq has an option to fix this type of data using make_valid=>true: https://gis.stackexchange.com/a/376870
Now I have a fully working solution, it is relatively simple but it requires overwriting the table after creation:

import geopandas as gpd
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"
df = gpd.read_file("my_file")

df.to_wkt().to_gbq(table_id,if_exists="replace")

cols = ",".join("st_geogfromtext(geometry, make_valid => TRUE) as geometry" if col == "geometry" else col for col in df)

query = f"CREATE OR REPLACE TABLE {table_id} AS SELECT {cols} FROM {table_id}"
# print(query)
query_job = client.query(query)

I thought I'd share this solution in case others have similar issues.

It would be nice if make_valid=True could somehow be set in df.to_gbq or in bigquery.LoadJobConfig, but I have no idea whether the architecture allows this.

EDIT: I learned that shapely also has make_valid, this still didn't work with to_wkt because of the issue mentioned before but it does work with the geojson solution. This solution worked for all the data I tested as well:

import geopandas as gpd
import json
import shapely
from shapely.validation import make_valid
from google.cloud import bigquery

client = bigquery.Client()

table_id = "dataset.tablename"

df = gpd.read_file("my_file")

#determine schema
type_dict = {
    'b' : 'BOOLEAN',
    'i' : 'INTEGER',
    'f' : 'FLOAT',
    'O' : 'STRING',
    'S' : 'STRING',
    'U' : 'STRING'
}
schema = [{'name' : col_name, 'type' : "GEOGRAPHY" if col_name == "geometry" else type_dict.get(col_type.kind, 'STRING')} for (col_name, col_type) in df.dtypes.iteritems()]

df_json = pd.DataFrame({
    col: (df[col] if col != 'geometry' else df[col].map(lambda x: ujson.dumps(shapely.geometry.mapping(make_valid(x)))))
    for col in df
})

job_config = bigquery.LoadJobConfig(schema=schema)
job = client.load_table_from_dataframe(
    df_json,
    table_id, 
    job_config=job_config
)
job.result()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants