-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_table support for geospatial data #20
Comments
should we keep 'geopandas' as an optional dependency? we added ´geopandas´ before and we need to move back and keep it just as an optional dependency because some problems with the packaging. |
In terms of code path/efficiency, we'd like to use the new BE "wkb" binary import if possible, and in any case get the geo import into one of the columnar loaders (row-wise is just too slow for practical use in most cases). I presume from geopandas this implies converting shapely binary to wkb with shapely's .wkb method. |
Is geopandas dependency still a packaging problem, even with the split? If so, then we might keep it optional. However if it works as its supposed to, it would be great to be able to rely on it. |
Confirmed (9/21) that this code path does exist. So the issues are interface and performance. Second major issue is performance. Columnar geopandas shapely load is even slower than row-wise load of WKT: The third issue is the strangest. The imported data schema looks as expected in Immerse (type POINT in my testing). But Immerse does not recognize the points in its point renderer. The column name does not appear in the selection list for latitude or longitude. However if a custom measure is made and pointed to the column name, Immerse renders the points as expected. Since Immerse doesn't report SRID, my only guess is that ST_POINT is created with the default SRID of 0 rather than being wrapped with ST_SETSRID. |
So, not sure if to make this into an epic, but three issues above may require separate resolution:
|
One related comment on performance side. Geopandas group is currently working to incorporate pygeos support. This is partially available in their version 0.8 - which is vectorized and optionally using pygeos/libgeos. |
Just some quick notes. @mikehinchey is testing a branch of pymapd which uses .wkb method instead of .wkt and reports that while it doesn't work with current release, does work on Master (because BE WKB not picked for last release). Correctness on basic geo types confirmed, performance testing now underway. May be able to get additional boost from load_table_arrow relative to load_table_binary, and both should in theory work. Presuming we get at least some performance gain, we'll issue a PR for pymapd or pyomnisci (moving target!). Remaining concerns include render group creation for polygons, and single-threadedness of load_table methods. Also, beyond this upgrade, we may want to confir with our Quansite colleagues to see if this same method could be used to speed up geo imports in Ibis. |
@mflaxman10 thanks for the update! related to ibis, not sure, but I think that at least we will need this PR merged https://github.com/ibis-project/ibis/pull/2165/files ... feel free to add any extra review/comments there. and let me know if we can help with anything else related to this issue! thanks! |
OK, initial performance testing numbers are back. Looks like a good speedup although not earth-shattering. Approximately 1.7x improvement. This may improve further when geopandas revamps its internal infrastructure to support libgeos. |
A few notes:
|
Ran into confusing error message when trying to load geopandas dataframe with columnar loader. Got ‘could not generate chunk key' Ultimately, this is trying to tell you "This method variant requires a pre-existing table, please generate one prior to loading" Not sure if this should be a doc-fix, or if we can't just adapt the 'infer' methods in the case to work similarly to in base pandas dataframe case. |
We need to add support in pymapd for geo data formats via
load_table
- minimally, to be able to passgeopandas
dataframes to the load function.The main thing is to make this performant since we'd want to do this using the columnar load path, ideally.
The text was updated successfully, but these errors were encountered: