-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some other projects for fast / parallel operations #1
Comments
Hello, Maybe in a few words the background why I wrote/am writing geofileops. It is actually a project in support of the project https://github.com/theroggy/orthoseg, which is a project to extract features from eg. aerial photos and output them as vector data, eg. roads, buildings,... (using deep neural networks). The data extraction is per image tile of +- 500x500 meters, and afterwards they need to be dissolved/unioned together. So I was looking for two things:
Because I didn't find any (open source) projects that met above criteria, I started by writing some helper functions within orthoseg. They kept growing and growing though, so I moved the code to a seperate project: geofileops. Some of the projects you listed I was aware of, some were new to me:
|
What file format do you typically use for those bigger files? (shapefiles, geopackage?)
Yes, previously it was indeed tied to the experimental geopandas branch, now it works with plain geopandas (although installing pygeos, so geopandas uses that, is still recommended I would say, when trying dask-geopandas).
It's explicitly one of the goals to enable working on larger-than-memory data, yes (that's a typical use case for dask, when used on a laptop, next to running the same code then on a larger (distributed) cluster). But I did a quick experiment with For your use case, also the writing part would be needed, of course. |
Geopackage. In my experience shapefile only works up to 2 GB... but I didn't explicitly test it for this case, I already used geopackage.
I use concurrent.futures to start multiple processes. Each process then:
The main thread (or process in this case) copies the separate temporary result files to one result file to avoid file locking issues. Because - for larger files - the calculation is in more batches than there are active worker processes for large files, the main thread typically can keep up with copying the result to the final result file as it starts moving data already once the first worker process is ready. I noticed that geopandas will support appending in the next version... so once this is released :-) I'll test if using a simple locking mechanism using a BUSY file or something like that to make sure only one worker process is accessing the result file at the same time gives good results as well. This would simplify the code quite a bit. BTW: Something I forgot in my previous answer: you don't need to thank me to use geopandas... thank you for developing it! |
Dag Pieter,
Sorry for opening an issue a bit "out of nowhere" here, but I noticed your repo (through the geopandas issue you commented on), and thought to share a few things.
I don't want to seem like a jerk "knowing better", just genuinely thought you might interested in those links.
First, really cool you are using GeoPandas to build upon! ;) (or at least for parts of the repo)
Since you are focusing on "fast" operations and doing things in parallel, those projects and developments might be of interest to you:
PyGEOS
: this is a new wrapper of GEOS, and is going to become "Shapely 2.0" (long story can be read here: https://github.com/shapely/shapely-rfc/pull/1/files). This blogpost gives a bit of background: https://caspervdw.github.io/Introducing-Pygeos/, but basically it provides all functionality of Shapely, but through faster, vectorized functions.And in the upcoming GeoPandas 0.8 release (for now you need to use master), this can already be used under the hood, if you have pygeos installed (see https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency), and should automatically give a speed-up in spatial operations.
For file reading, there is also https://github.com/brendan-ward/pyogrio, which is experimenting with a faster replacement of fiona (but this might be more experimental).
For running things in parallel, there is the experimental https://github.com/jsignell/dask-geopandas package to connect GeoPandas and Dask (general parallel computing / task scheduling package in Python specifically targetting the data science use cases). The idea is that it divides the GeoDataFrame in chunks (partitions) and then operations are run in parallel on those partitions. But, this is mostly done under the hood by dask, so for the user it gives a very similar interface as GeoPandas. For example, for a parallel buffer operation, code could look like:
and the buffer operation would be run in parallel (using multithreading, by default, but could also choose to use multiprocessing).
I saw that you were parallelizing some operations like buffer, and for those the dask-geopandas project might be interesting (it won't be able to help with making ogr interactions parallel, though). It's a very young project, but contributions are always welcome ;)
The text was updated successfully, but these errors were encountered: