Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment details #5

Open
martinfleis opened this issue Mar 31, 2022 · 5 comments
Open

Environment details #5

martinfleis opened this issue Mar 31, 2022 · 5 comments

Comments

@martinfleis
Copy link

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :). Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

@theroggy
Copy link
Contributor

Hi, can you share a bit about the environment you used to test geopandas? I am especially interested in pygeos - is it installed? If not, I would recommend trying that as you will likely get a massive speedup.

Sure. The tests were ran with pygeos installed...

Another suggestion would be to use dask-geoapandas instead to vanilla geopandas but I guess you already thought about that :).

Yes! I saw a few hours ago that dask-geopandas 0.1.0 was released, so I already started on a dask-geopandas version :-).

Since the timing also includes IO, you can also try replacing geopandas.read_file with pyogrio.read_dataframe.

I did some tests using pyogrio and this indeed gives a huge difference, especially for the writing part. I didn't use it in the benchmark because during some tests I did it didn't feel like production ready yet + the integration is geopandas isn't ready yet.
However, because it is such a big difference and this is really interesting to know, I'll add a version of the geopandas benchmark that uses pyogrio for IO...

@theroggy
Copy link
Contributor

theroggy commented Apr 1, 2022

I added a version of geopandas using pyogrio for I/O... and as expected the time spent on the IO part was reduced significantly. Especially for the buffer operation this gives a huge difference, as the operation itself takes very little time...

  • read: from 43.5s -> 7.5s
  • buffer: stays 22s
  • write: from 256s -> 56.5s

For e.g. dissolve the impact is obviously smaller, as that operations needs a lot more processing time.

I also added a benchmark for dask-geopandas, but at the moment only for buffer. I haven't ever used dask, so I'l need to figure out a bit how I use it best for the more interesting cases... I saw in the manual of dask-geopandas there is a specific section about dissolve... so I'll have to have a look at that...

Some remarks regarding the dask-geopandas buffer benchmark:

  • the buffer operation itself now became even more negligable: 2.8s, so the 12 CPU's available are put to good use :-)
  • writing to geopackage takes still the same 56.5s, so the lion share of the 65s. Writing to .parquet is a lot faster because the writing is parallelized to seperate files per partition... (7.5s), so the same test with parquet as output results in ~18 s for dask-geopandas because the files don't need to be merged to one file.
  • the test doesn't support processing files too large to proces in memory yet, as the file is read/written in one go. The reading isn't in parallel yet either, so still room for further performance improvement there as well...

@theroggy
Copy link
Contributor

theroggy commented Apr 1, 2022

I added dissolve benchmarks for dask-geopandas now as well, but the results aren't great. Based on the documentation on how it works under the hood it is probably also "normal" that it isn't faster than vanilla geopandas as the unary_union operation is applied twice on all geometries, which is quite costly. Or... I did something stupid in the implementation, that's obviously also possible ;-).

Am I right that overlay operations aren't supported yet? In that case I can't implement the intersect benchmark yet...

@martinfleis
Copy link
Author

martinfleis commented Apr 26, 2022

Yes, dissolve is not always faster, that is why documentation includes that extra snippet allowing faster option for in-memory data - https://dask-geopandas.readthedocs.io/en/stable/guide/dissolve.html#alternative-solution.

The dissolve implementation in dask-geopandas is designed to be scalable and distributable, so it can work out-of-core but at the cost of performance in some situations.

Am I right that overlay operations aren't supported yet?

Correct. As in overlay is not implemented but predicates and operations themselves are implemented.

@theroggy
Copy link
Contributor

theroggy commented Apr 27, 2022

Dissolve is indeed a bitch to get fast + scalable. Took me a while in geofileops as well to get it (a bit) right.

For overlays, I saw that clip does exist already... but when I try it on my test case it always crashes due to lack of memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants