Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_table support for geospatial data #20

Open
niviksha opened this issue Sep 21, 2020 · 11 comments
Open

load_table support for geospatial data #20

niviksha opened this issue Sep 21, 2020 · 11 comments

Comments

@niviksha
Copy link

We need to add support in pymapd for geo data formats via load_table - minimally, to be able to pass geopandas dataframes to the load function.

The main thing is to make this performant since we'd want to do this using the columnar load path, ideally.

@xmnlab
Copy link
Contributor

xmnlab commented Sep 21, 2020

should we keep 'geopandas' as an optional dependency?

we added ´geopandas´ before and we need to move back and keep it just as an optional dependency because some problems with the packaging.

@mflaxman10
Copy link
Collaborator

In terms of code path/efficiency, we'd like to use the new BE "wkb" binary import if possible, and in any case get the geo import into one of the columnar loaders (row-wise is just too slow for practical use in most cases). I presume from geopandas this implies converting shapely binary to wkb with shapely's .wkb method.

@mflaxman10
Copy link
Collaborator

Is geopandas dependency still a packaging problem, even with the split? If so, then we might keep it optional. However if it works as its supposed to, it would be great to be able to rely on it.

@mflaxman10
Copy link
Collaborator

mflaxman10 commented Sep 21, 2020

Confirmed (9/21) that this code path does exist. So the issues are interface and performance.
The first interface issue is why I didn't discover this feature even when looking for it...the default load path converts geo features into OmniSci TEXT ENCODED DICT. In order to get the expected behavior of geo feature loading, manual DDL construction is required.

Second major issue is performance. Columnar geopandas shapely load is even slower than row-wise load of WKT:
Took 2.5 mins for 3m rows, versus 1.45 min for row-wise WKT, versus 12s for TEXT encoded to DICTs

The third issue is the strangest. The imported data schema looks as expected in Immerse (type POINT in my testing). But Immerse does not recognize the points in its point renderer. The column name does not appear in the selection list for latitude or longitude. However if a custom measure is made and pointed to the column name, Immerse renders the points as expected. Since Immerse doesn't report SRID, my only guess is that ST_POINT is created with the default SRID of 0 rather than being wrapped with ST_SETSRID.

@mflaxman10
Copy link
Collaborator

So, not sure if to make this into an epic, but three issues above may require separate resolution:

  1. investigate why/where default upload does not auto-detect as geo
  2. investigate performance, and in particular if WKB support in core can be leveraged to improve speed
  3. determine why loaded geometries don't detect as geo columns or render by default

@mflaxman10
Copy link
Collaborator

One related comment on performance side. Geopandas group is currently working to incorporate pygeos support. This is partially available in their version 0.8 - which is vectorized and optionally using pygeos/libgeos.

@mflaxman10
Copy link
Collaborator

Just some quick notes. @mikehinchey is testing a branch of pymapd which uses .wkb method instead of .wkt and reports that while it doesn't work with current release, does work on Master (because BE WKB not picked for last release). Correctness on basic geo types confirmed, performance testing now underway. May be able to get additional boost from load_table_arrow relative to load_table_binary, and both should in theory work.

Presuming we get at least some performance gain, we'll issue a PR for pymapd or pyomnisci (moving target!). Remaining concerns include render group creation for polygons, and single-threadedness of load_table methods.

Also, beyond this upgrade, we may want to confir with our Quansite colleagues to see if this same method could be used to speed up geo imports in Ibis.

@xmnlab
Copy link
Contributor

xmnlab commented Oct 22, 2020

@mflaxman10 thanks for the update! related to ibis, not sure, but I think that at least we will need this PR merged https://github.com/ibis-project/ibis/pull/2165/files ... feel free to add any extra review/comments there.

and let me know if we can help with anything else related to this issue! thanks!

@mflaxman10
Copy link
Collaborator

OK, initial performance testing numbers are back. Looks like a good speedup although not earth-shattering. Approximately 1.7x improvement. This may improve further when geopandas revamps its internal infrastructure to support libgeos.

@mikehinchey
Copy link
Contributor

A few notes:

  1. I used the geopandas property wkb_hex because the thrift api only accepts a string. I wonder if improved performance could be gained if the thrift api accepted a byte array.
  2. The 1.7x improvement was with method=columnar. I haven't tried method=arrow.
  3. We would need some sort of switch to use WKT if pymapd or pyomnisci should continue to work for 5.4.x DB.

@mflaxman10
Copy link
Collaborator

Ran into confusing error message when trying to load geopandas dataframe with columnar loader. Got ‘could not generate chunk key' Ultimately, this is trying to tell you "This method variant requires a pre-existing table, please generate one prior to loading"

Not sure if this should be a doc-fix, or if we can't just adapt the 'infer' methods in the case to work similarly to in base pandas dataframe case.

@mikehinchey mikehinchey transferred this issue from heavyai/pymapd Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants