Environment details #5

martinfleis · 2022-03-31T18:12:23Z

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :). Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

The text was updated successfully, but these errors were encountered:

theroggy · 2022-03-31T21:06:22Z

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Sure. The tests were ran with pygeos installed...

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :).

Yes! I saw a few hours ago that dask-geopandas 0.1.0 was released, so I already started on a dask-geopandas version :-).

Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

I did some tests using pyogrio and this indeed gives a huge difference, especially for the writing part. I didn't use it in the benchmark because during some tests I did it didn't feel like production ready yet + the integration is geopandas isn't ready yet.
However, because it is such a big difference and this is really interesting to know, I'll add a version of the geopandas benchmark that uses pyogrio for IO...

theroggy · 2022-04-01T08:38:06Z

I added a version of geopandas using pyogrio for I/O... and as expected the time spent on the IO part was reduced significantly. Especially for the buffer operation this gives a huge difference, as the operation itself takes very little time...

read: from 43.5s -> 7.5s
buffer: stays 22s
write: from 256s -> 56.5s

For e.g. dissolve the impact is obviously smaller, as that operations needs a lot more processing time.

I also added a benchmark for dask-geopandas, but at the moment only for buffer. I haven't ever used dask, so I'l need to figure out a bit how I use it best for the more interesting cases... I saw in the manual of dask-geopandas there is a specific section about dissolve... so I'll have to have a look at that...

Some remarks regarding the dask-geopandas buffer benchmark:

the buffer operation itself now became even more negligable: 2.8s, so the 12 CPU's available are put to good use :-)
writing to geopackage takes still the same 56.5s, so the lion share of the 65s. Writing to .parquet is a lot faster because the writing is parallelized to seperate files per partition... (7.5s), so the same test with parquet as output results in ~18 s for dask-geopandas because the files don't need to be merged to one file.
the test doesn't support processing files too large to proces in memory yet, as the file is read/written in one go. The reading isn't in parallel yet either, so still room for further performance improvement there as well...

theroggy · 2022-04-01T20:11:45Z

I added dissolve benchmarks for dask-geopandas now as well, but the results aren't great. Based on the documentation on how it works under the hood it is probably also "normal" that it isn't faster than vanilla geopandas as the unary_union operation is applied twice on all geometries, which is quite costly. Or... I did something stupid in the implementation, that's obviously also possible ;-).

Am I right that overlay operations aren't supported yet? In that case I can't implement the intersect benchmark yet...

martinfleis · 2022-04-26T21:41:35Z

Yes, dissolve is not always faster, that is why documentation includes that extra snippet allowing faster option for in-memory data - https://dask-geopandas.readthedocs.io/en/stable/guide/dissolve.html#alternative-solution.

The dissolve implementation in dask-geopandas is designed to be scalable and distributable, so it can work out-of-core but at the cost of performance in some situations.

Am I right that overlay operations aren't supported yet?

Correct. As in overlay is not implemented but predicates and operations themselves are implemented.

theroggy · 2022-04-27T02:45:16Z

Dissolve is indeed a bitch to get fast + scalable. Took me a while in geofileops as well to get it (a bit) right.

For overlays, I saw that clip does exist already... but when I try it on my test case it always crashes due to lack of memory?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment details #5

Environment details #5

martinfleis commented Mar 31, 2022

theroggy commented Mar 31, 2022

theroggy commented Apr 1, 2022 •

edited

theroggy commented Apr 1, 2022 •

edited

martinfleis commented Apr 26, 2022 •

edited

theroggy commented Apr 27, 2022 •

edited

Environment details #5

Environment details #5

Comments

martinfleis commented Mar 31, 2022

theroggy commented Mar 31, 2022

theroggy commented Apr 1, 2022 • edited

theroggy commented Apr 1, 2022 • edited

martinfleis commented Apr 26, 2022 • edited

theroggy commented Apr 27, 2022 • edited

theroggy commented Apr 1, 2022 •

edited

theroggy commented Apr 1, 2022 •

edited

martinfleis commented Apr 26, 2022 •

edited

theroggy commented Apr 27, 2022 •

edited