Improving spatial operations

In this doc I will be using as reference a generic ~30k row csv. All that is relevant is that there is a geometry column and any number of additional columns containing floats. Aggregations such as sums, means, etc. are performed on these columns after a spatial operation has been performed to return a subset of the data frame. See the example geometries csv as well as the run.py (Python3) file to see a stripped down example dataset and operation logic.

Current situation

All geospatial operations in Geopandas are costly, this includes but is not limited to: unary unions, intersections, within booleans, distance calculations. The underlying functions that Geopandas is calling are those from GEOS, which is itself a C++ port of the Java Topo Suite. All of this is to say, it is fast and reliable. Parts of PostGIS work with GEOS, too.

So what is causing the slowdown? I can think of two possibilities right now:

The bottleneck here is Python itself.
There are optimizations that should be implemented that I've failed to identify.

Example

In the included example Pythons script, I load in a dataset containing geometries and a value column (named “value”). The goal is to generate a sum of all values within a certain buffer (in this case, 500 units) of the reference geometry. This operation is to be performed on each row’s geometry.

Presently, the operation performs two tasks intended to speed up the operation. First, it creates a spatial index from the GeoDataFrame based on its geometry column. Second, it preprocesses a buffer column.

The summarize operation applied to each row performs a comparative buffer (buffering the reference geometry and getting all intersections to that, and then doing the inverse and buffering the other geometries and seeing if the normal geometry intersects in that situation). This methodology replicates the same for ST_DWithin from PostGIS (reference in their repo).

Results

Running run.py will result in the following output (time in seconds before the colon):

$ python3 run.py
0.244s: loaded in reference csv

1.787s: updated geom column to MultiPolygon type

0.008s: converted dataframe to geodataframe

3.345s: created spatial indices

5.472s: Created buffered geometry column

563.553s: completed summarize() application on geodataframe

          value
0  10319.032630
1     32.595417
2      3.991619
3     14.131560
4     32.533223

Note: Times will vary on each run through, naturally.

The apply operation takes a long time; about 9.4 minutes. That's nearly 0.2 seconds per row, which seems terribly slow. You can imagine how a series of spatial operations can quickly amount to cumbersome execution times.

Question

What, if any, additional optimizations could be used to make geometric operations faster in Python? For example, I've found additional prepartion steps that could potentially be used, such as shapely.prepared() (and asked about them on SO), but from what I can tell they provide no substantive improvements over what is already achieved via the spatial index.

A Final Note

Tom Augspurger makes a strong case for vectorization and the utilization of Numpy list comprehension in lieu of the costly .apply() operation. I'm very open to exploring how to operationalize this concept with large datasets but have concerns over its feasibility with larger datasets. Our example is similar to his "origin-destination" matrix calculation. With this methodology, we would need to create a subindex of each of the 30k geometries for each "origin" index. Dealing with a matrix that is (n rows)^2 long becomes unreasonable for in-memory operations and brings into question how "really" operational the vector methodology is.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
dask_method		dask_method
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
example_geometries.csv		example_geometries.csv
faster.ipynb		faster.ipynb
faster.py		faster.py
precompute.py		precompute.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dask_method

dask_method

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

example_geometries.csv

example_geometries.csv

faster.ipynb

faster.ipynb

faster.py

faster.py

precompute.py

precompute.py

requirements.txt

requirements.txt

run.py

run.py

Repository files navigation

Improving spatial operations

Current situation

Example

Results

Question

A Final Note

About

Releases

Packages

Contributors 2

Languages

kuanb/geopandas_opt

Folders and files

Latest commit

History

Repository files navigation

Improving spatial operations

Current situation

Example

Results

Question

A Final Note

About

Topics

Resources

Stars

Watchers

Forks

Languages