Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some other projects for fast / parallel operations #1

Open
jorisvandenbossche opened this issue Jun 12, 2020 · 3 comments
Open

Some other projects for fast / parallel operations #1

jorisvandenbossche opened this issue Jun 12, 2020 · 3 comments
Labels
question Further information is requested

Comments

@jorisvandenbossche
Copy link

Dag Pieter,

Sorry for opening an issue a bit "out of nowhere" here, but I noticed your repo (through the geopandas issue you commented on), and thought to share a few things.
I don't want to seem like a jerk "knowing better", just genuinely thought you might interested in those links.

First, really cool you are using GeoPandas to build upon! ;) (or at least for parts of the repo)

Since you are focusing on "fast" operations and doing things in parallel, those projects and developments might be of interest to you:

  • PyGEOS: this is a new wrapper of GEOS, and is going to become "Shapely 2.0" (long story can be read here: https://github.com/shapely/shapely-rfc/pull/1/files). This blogpost gives a bit of background: https://caspervdw.github.io/Introducing-Pygeos/, but basically it provides all functionality of Shapely, but through faster, vectorized functions.
    And in the upcoming GeoPandas 0.8 release (for now you need to use master), this can already be used under the hood, if you have pygeos installed (see https://geopandas.readthedocs.io/en/latest/install.html#using-the-optional-pygeos-dependency), and should automatically give a speed-up in spatial operations.

  • For file reading, there is also https://github.com/brendan-ward/pyogrio, which is experimenting with a faster replacement of fiona (but this might be more experimental).

  • For running things in parallel, there is the experimental https://github.com/jsignell/dask-geopandas package to connect GeoPandas and Dask (general parallel computing / task scheduling package in Python specifically targetting the data science use cases). The idea is that it divides the GeoDataFrame in chunks (partitions) and then operations are run in parallel on those partitions. But, this is mostly done under the hood by dask, so for the user it gives a very similar interface as GeoPandas. For example, for a parallel buffer operation, code could look like:

    import geopandas
    import dask_geopandas
    
    df = geopandas.read_file('...')
    ddf = dask_geopandas.from_geopandas(df, npartitions=4)
    ddf.geometry = ddf.buffer(...)
    

    and the buffer operation would be run in parallel (using multithreading, by default, but could also choose to use multiprocessing).
    I saw that you were parallelizing some operations like buffer, and for those the dask-geopandas project might be interesting (it won't be able to help with making ogr interactions parallel, though). It's a very young project, but contributions are always welcome ;)

@theroggy
Copy link
Collaborator

Hello,
No problem at all, very interested!

Maybe in a few words the background why I wrote/am writing geofileops. It is actually a project in support of the project https://github.com/theroggy/orthoseg, which is a project to extract features from eg. aerial photos and output them as vector data, eg. roads, buildings,... (using deep neural networks). The data extraction is per image tile of +- 500x500 meters, and afterwards they need to be dissolved/unioned together.
However, the result file that needs to be dissolved can be relatively large, the biggest one till now was 16 GB... even though my "area of interest" is relatively small (the Flemish region in Belgium).

So I was looking for two things:

  1. it should be possible to run geo operations on geo files that don't fit in memory
  2. use available cpu's, otherwise way to slow (days of processing)

Because I didn't find any (open source) projects that met above criteria, I started by writing some helper functions within orthoseg. They kept growing and growing though, so I moved the code to a seperate project: geofileops.

Some of the projects you listed I was aware of, some were new to me:

  • Pygeos + integration in GeoPandas 0.8: I'm very aware of this. I have been following the efforts/ideas to speedup Geopandas for a while, including eg. the tests with cython. So I'm very interested in the speedup this will give.
  • pyogrio: I didn't know this one. Interesting: could indeed be very interesting if this could be merged in geopandas and/or fiona. However, for my personal uses file I/O is not a (big) bottleneck at the moment as I'm parallelizing the file IO.
  • dask-geopandas: I did encounter this one in my scan of existing projects some time ago, but if I remember correctly, back then it was coupled with the cython experiments in geopandas. The examples I remember from back then also weren't as "clean" and straightforward as the one you put here, but I could remember wrong. I'm not sure if solving the issue of processing geo files that are too large to process in memory is possible/a goal/... in this project?

@jorisvandenbossche
Copy link
Author

However, the result file that needs to be dissolved can be relatively large, the biggest one till now was 16 GB

What file format do you typically use for those bigger files? (shapefiles, geopackage?)

dask-geopandas: ... back then it was coupled with the cython experiments in geopandas.

Yes, previously it was indeed tied to the experimental geopandas branch, now it works with plain geopandas (although installing pygeos, so geopandas uses that, is still recommended I would say, when trying dask-geopandas).

I'm not sure if solving the issue of processing geo files that are too large to process in memory is possible/a goal/... in this project?

It's explicitly one of the goals to enable working on larger-than-memory data, yes (that's a typical use case for dask, when used on a laptop, next to running the same code then on a larger (distributed) cluster).
Although it is a goal, it's not yet directly/easily possible, though. Since there is not yet a built-in file reader that does that for you.

But I did a quick experiment with read_file version that reads the file in chunks (I think rather similar as what you do for reading with geopandas): https://nbviewer.jupyter.org/gist/jorisvandenbossche/c94bfc9626e622fda7285ed88a4d771a
With that, one could already read files that are larger than memory and calculate something on it.

For your use case, also the writing part would be needed, of course.
How do you deal with that in geofileops? As I don't think you can write to the same file from different processes with GDAL?

@theroggy
Copy link
Collaborator

theroggy commented Jun 13, 2020

However, the result file that needs to be dissolved can be relatively large, the biggest one till now was 16 GB

What file format do you typically use for those bigger files? (shapefiles, geopackage?)

Geopackage. In my experience shapefile only works up to 2 GB... but I didn't explicitly test it for this case, I already used geopackage.

I'm not sure if solving the issue of processing geo files that are too large to process in memory is possible/a goal/... in this project?

It's explicitly one of the goals to enable working on larger-than-memory data, yes (that's a typical use case for dask, when used on a laptop, next to running the same code then on a larger (distributed) cluster).
Although it is a goal, it's not yet directly/easily possible, though. Since there is not yet a built-in file reader that does that for you.

But I did a quick experiment with read_file version that reads the file in chunks (I think rather similar as what you do for reading with geopandas): https://nbviewer.jupyter.org/gist/jorisvandenbossche/c94bfc9626e622fda7285ed88a4d771a
With that, one could already read files that are larger than memory and calculate something on it.

For your use case, also the writing part would be needed, of course.
How do you deal with that in geofileops? As I don't think you can write to the same file from different processes with GDAL?

I use concurrent.futures to start multiple processes. Each process then:

  1. reads the data it needs. For the geopandas cases indeed either via the "rows" parameter (buffer, simplify,...) or via the "bbox" parameter (for dissolve/union).
  2. calculates what it needs to calculate
  3. writes result to a separate temporary file per process

The main thread (or process in this case) copies the separate temporary result files to one result file to avoid file locking issues. Because - for larger files - the calculation is in more batches than there are active worker processes for large files, the main thread typically can keep up with copying the result to the final result file as it starts moving data already once the first worker process is ready.

I noticed that geopandas will support appending in the next version... so once this is released :-) I'll test if using a simple locking mechanism using a BUSY file or something like that to make sure only one worker process is accessing the result file at the same time gives good results as well. This would simplify the code quite a bit.

BTW: Something I forgot in my previous answer: you don't need to thank me to use geopandas... thank you for developing it!

@theroggy theroggy added the question Further information is requested label May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants