Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask geopandas to parquet does not seem to persist spatial paritions #260

Open
v2thegreat opened this issue Dec 6, 2023 · 1 comment
Open

Comments

@v2thegreat
Copy link
Contributor

Problem

When using Dask GeoPandas to write a GeoDataFrame to Parquet format, the spatial partitions appear not to be persisted correctly. This issue is observed when storing GeoPandas data with spatial information using Dask and the Parquet format.

Expected Behavior

The spatial partitions of the GeoDataFrame should be correctly persisted in the resulting Parquet files. This means that the spatial properties of the GeoDataFrame, such as geometry information, should be preserved during the conversion process, making it much faster to query.

Steps to Reproduce

  1. Create a GeoDataFrame with Dask GeoPandas.
  2. Attempt to write the GeoDataFrame to Parquet format using Dask.
  3. Read the Dask GeoDataFrame.
  4. Observe that the resulting Parquet files do not seem to persist spatial partitions correctly.

Example Code

import dask_geopandas as dg
import geopandas as gpd

# Create a GeoDataFrame with Dask GeoPandas
sample_data = dg.from_geopandas(gpd.read_file('path/to/shapefile.shp'), npartitions=4)
sample_data = sample_data.spatial_shuffle(shuffle='tasks')
sample_data.spatial_partitions.explore() # visualize the spatial partitions here

# Write the GeoDataFrame to Parquet
sample_data.to_parquet('path/to/output', write_metadata_file=True)

sample_data_reloaded = dg.read_parquet('path/to/output', gather_spatial_partitions=True)
sample_data_reloaded.spatial_partitions # None

My end goal is to be able to query the data quickly and only grab the partitions that contain the bounds that I'm interested in, such as when using clip or cx

@TomAugspurger
Copy link
Contributor

I think this is working properly with the latest versions of these libraries:

dask 2024.2.1, pyarrow 15.0.0, geopandas 0.14.3, and dask-geopandas main. Can you try this script?

import geodatasets
import tempfile
import dask_geopandas
import geopandas

df = geopandas.read_file(geodatasets.get_url("geoda airbnb"))
ddf = dask_geopandas.from_geopandas(df, npartitions=2).spatial_shuffle()
ddf.to_parquet("/tmp/out.parquet")
dask_geopandas.read_parquet("/tmp/data.parquet").spatial_partitions

For me that outputs

0    POLYGON ((-87.59852 41.77113, -87.59852 42.023...
1    POLYGON ((-87.52414 41.64454, -87.52414 41.977...
dtype: geometry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants