load_table support for geospatial data #20

niviksha · 2020-09-21T17:56:01Z

We need to add support in pymapd for geo data formats via load_table - minimally, to be able to pass geopandas dataframes to the load function.

The main thing is to make this performant since we'd want to do this using the columnar load path, ideally.

The text was updated successfully, but these errors were encountered:

xmnlab · 2020-09-21T18:02:48Z

should we keep 'geopandas' as an optional dependency?

we added ´geopandas´ before and we need to move back and keep it just as an optional dependency because some problems with the packaging.

mflaxman10 · 2020-09-21T18:10:40Z

In terms of code path/efficiency, we'd like to use the new BE "wkb" binary import if possible, and in any case get the geo import into one of the columnar loaders (row-wise is just too slow for practical use in most cases). I presume from geopandas this implies converting shapely binary to wkb with shapely's .wkb method.

mflaxman10 · 2020-09-21T18:12:17Z

Is geopandas dependency still a packaging problem, even with the split? If so, then we might keep it optional. However if it works as its supposed to, it would be great to be able to rely on it.

mflaxman10 · 2020-09-21T19:53:13Z

Confirmed (9/21) that this code path does exist. So the issues are interface and performance.
The first interface issue is why I didn't discover this feature even when looking for it...the default load path converts geo features into OmniSci TEXT ENCODED DICT. In order to get the expected behavior of geo feature loading, manual DDL construction is required.

Second major issue is performance. Columnar geopandas shapely load is even slower than row-wise load of WKT:
Took 2.5 mins for 3m rows, versus 1.45 min for row-wise WKT, versus 12s for TEXT encoded to DICTs

The third issue is the strangest. The imported data schema looks as expected in Immerse (type POINT in my testing). But Immerse does not recognize the points in its point renderer. The column name does not appear in the selection list for latitude or longitude. However if a custom measure is made and pointed to the column name, Immerse renders the points as expected. Since Immerse doesn't report SRID, my only guess is that ST_POINT is created with the default SRID of 0 rather than being wrapped with ST_SETSRID.

mflaxman10 · 2020-09-22T04:15:24Z

So, not sure if to make this into an epic, but three issues above may require separate resolution:

investigate why/where default upload does not auto-detect as geo
investigate performance, and in particular if WKB support in core can be leveraged to improve speed
determine why loaded geometries don't detect as geo columns or render by default

mflaxman10 · 2020-09-22T16:10:53Z

One related comment on performance side. Geopandas group is currently working to incorporate pygeos support. This is partially available in their version 0.8 - which is vectorized and optionally using pygeos/libgeos.

mflaxman10 · 2020-10-22T00:39:39Z

Just some quick notes. @mikehinchey is testing a branch of pymapd which uses .wkb method instead of .wkt and reports that while it doesn't work with current release, does work on Master (because BE WKB not picked for last release). Correctness on basic geo types confirmed, performance testing now underway. May be able to get additional boost from load_table_arrow relative to load_table_binary, and both should in theory work.

Presuming we get at least some performance gain, we'll issue a PR for pymapd or pyomnisci (moving target!). Remaining concerns include render group creation for polygons, and single-threadedness of load_table methods.

Also, beyond this upgrade, we may want to confir with our Quansite colleagues to see if this same method could be used to speed up geo imports in Ibis.

xmnlab · 2020-10-22T14:57:16Z

@mflaxman10 thanks for the update! related to ibis, not sure, but I think that at least we will need this PR merged https://github.com/ibis-project/ibis/pull/2165/files ... feel free to add any extra review/comments there.

and let me know if we can help with anything else related to this issue! thanks!

mflaxman10 · 2020-10-22T22:36:05Z

OK, initial performance testing numbers are back. Looks like a good speedup although not earth-shattering. Approximately 1.7x improvement. This may improve further when geopandas revamps its internal infrastructure to support libgeos.

mikehinchey · 2020-10-22T23:08:54Z

A few notes:

I used the geopandas property wkb_hex because the thrift api only accepts a string. I wonder if improved performance could be gained if the thrift api accepted a byte array.
The 1.7x improvement was with method=columnar. I haven't tried method=arrow.
We would need some sort of switch to use WKT if pymapd or pyomnisci should continue to work for 5.4.x DB.

mflaxman10 · 2021-02-26T00:44:28Z

Ran into confusing error message when trying to load geopandas dataframe with columnar loader. Got ‘could not generate chunk key' Ultimately, this is trying to tell you "This method variant requires a pre-existing table, please generate one prior to loading"

Not sure if this should be a doc-fix, or if we can't just adapt the 'infer' methods in the case to work similarly to in base pandas dataframe case.

mikehinchey transferred this issue from heavyai/pymapd Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_table support for geospatial data #20

load_table support for geospatial data #20

niviksha commented Sep 21, 2020

xmnlab commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020 •

edited

mflaxman10 commented Sep 22, 2020

mflaxman10 commented Sep 22, 2020

mflaxman10 commented Oct 22, 2020

xmnlab commented Oct 22, 2020

mflaxman10 commented Oct 22, 2020

mikehinchey commented Oct 22, 2020

mflaxman10 commented Feb 26, 2021

load_table support for geospatial data #20

load_table support for geospatial data #20

Comments

niviksha commented Sep 21, 2020

xmnlab commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020

mflaxman10 commented Sep 21, 2020 • edited

mflaxman10 commented Sep 22, 2020

mflaxman10 commented Sep 22, 2020

mflaxman10 commented Oct 22, 2020

xmnlab commented Oct 22, 2020

mflaxman10 commented Oct 22, 2020

mikehinchey commented Oct 22, 2020

mflaxman10 commented Feb 26, 2021

mflaxman10 commented Sep 21, 2020 •

edited