-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equivalent GeoTable abstraction for JS target #620
Comments
Thanks for the issue! I definitely welcome some thoughts on the API design.
I'm not sure if you've noticed, but I've renamed the underlying Rust struct to
It's not impossible if you do dynamic downcasting. E.g. in Python, the existing method on the table called See geoarrow-rs/python/core/src/table/mod.rs Line 24 in 42aa07b
geoarrow-rs/python/core/src/ffi/to_python/array.rs Lines 114 to 152 in 42aa07b
If we do dynamic downcasting, this shouldn't be an issue.
My expected sjoin API is
If |
Oops, you're indeed right (I was stuck between a combination of tunnel vision on returning enums and juggling the exact sequence of castings to get Arc down to JsValue) - a match statement filled with That sjoin API, that'd also include the predicate right? (within, intersects, etc). |
Yes, that too |
Alright, so agreed re naming ('Table'), there's a few scenarios for how I'd see this working, with varying degrees of re-use/integration and usability: Option 1 - IO methods always return geoarrow_wasm::Table, with fallible geometry methodsThe (very limited) spatial-related methods (presumably including one that details the geometry containing columns (index, name, detected geometry type) would need good handling for the spatial-free case of course. Option 2 - explicit geoarrow::Table construction from arrow_wasm::Table, mostly infallible geometry methodsIO methods return a JsValue, via an intermediate match-case conversion to geoarrow_wasm::table::Table or arrow_wasm::table::Table - this would require a way to convert between them in JS and deal with manual conversions, with something like: const plainTable = parquetFile.read() // just an arrow_wasm table at this point, no spatially aware methods available
const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry')) // one of the geoarrow-wasm geometry vectors
const outputTable = wasm.Table.from(plainTable).append_column(geometryColumn, 'geometry') // geoarrow-wasm's Table, geometry methods will only produce meaningful results after the append_column.
// alternatively
const outputTable = wasm.Table.from_arrow_and_geometry(plainTable, {'geometry': geometryColumn}) // or something to that effect Option 3 - arrow_wasm knowledge of geoarrow_wasm, bidirectional auto-conversionA feature-flag in arrow-wasm, adding a dependency on geoarrow-wasm (! cyclic dependency, cargo might be okay with these 🤷 ), enabling as little as possible. This would be required for geoparquet-wasm (and sibling) IO methods to return arrow_wasm::Table results that are capable of inferring and auto-casting this: const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry');
plainTable.append_column(geometryColumn, 'geometry'); to a geoarrow_wasm::Table, not an arrow_wasm::Table. The equivalent in reverse ( Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything. Option 2 is very close to geopandas' and pyogrio's approach - IO methods usually produce GeoDataframes but can produce plain old pandas dataframes (e.g. read_geometry=False, no geometry column found, etc), pandas dataframes can be consumed into GeoDataframes explicitly (but are not auto-upgraded by adding a GeoSeries column to a pandas DataFrame), GeoDataframes are implicitly downgraded once they lose their geometry column (by column slicing, drop, overwriting with wkb, etc). |
Agreed that it's harder to do for objects crossing memory sandboxes, but you are pretty likely to have parse_wkt and Table::append_column functions as part of the same wasm module - still very useful to be able to glue them together with JsCapsules, but no memory sandboxing problems. That specific form should be pretty doable. |
I'm not inclined to have separate structs for a Table in multiple crates. My API surface is large enough, that I don't want to have multiple objects, where we might have to re-implement non-spatial table operations on both I think my plan is none of the above. I don't want a We can loop through column metadata to look for the geometry column, or the user can pass in the column name/index. This is how the current To be clear, the So I think all geoarrow-wasm IO methods should return an Essentially, this pushes us more towards a functional API that takes in a table as the first argument rather than a method API. This is even more useful in Python where we can interpret any other Arrow object with zero copy. So we want to work with any arrow object, whether it's one we define or one defined by pyarrow or gdal or polars or duckdb or whatever. Thinking about the Pandas/GeoPandas split, I'd argue that the big distinction here is that Pandas didn't originally have first-class support for extension types. You weren't able to add associated metadata to columns or tables. I'm really not a fan of geopandas and pyogrio behavior here. I feel strongly that there should be a statically inferred output type for any input function. Perhaps there should be two overloads that return different output types for a different combination of input types, but at least it's known at compile time. I'm not fond of returning two different Python object types. p.s. I'm exhausted as I'm typing this so it's possible I didn't explain myself as well as I could've. Feel free to ask more questions |
At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata? I'm also pondering this around our current approach of having different exports for every geometry array type. When we add Z/M dimension support, should we have E.g. on the Python side, I've been thinking a rust-python Arrow building block library like arrow-wasm might make sense. Then maybe we have generic But at this point, I'm coming to believe that reducing maintenance complexity will be massively needed. I believe these GeoArrow-based projects are a multi-year endeavor (thinking ~5 years to maturity) and I need to ensure that I don't burn myself out. So reducing geometry-specific class exports will hopefully both improve interoperability and reduce API surface area. |
Excellent, that makes things much, much clearer. Cool, so toy usage more or less look like: // named exports semi-fantasised
import { geometry, centroid, convexHull, sjoin } from 'geoparquet-wasm';
// mandatory functional
const foo = geometry(table);
// mixed functional/method-based, what we have today on things like PointVector, etc
const intermediate = foo.convexHull().centroid() // semi-pointless, demonstrates method-chaining.
// fully functional approach
const intermediate = centroid(convexHull(foo)); // isomorphic with the above, easy to use with functional composition.
// mandatory functional
const intersectionTable = sjoin(table, otherTable, {how: 'inner', predicate: 'intersects'});
☝️ hard agree on this 😮💨
I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.
Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck". |
Yes, absolutely.
🙏
I think that they will always be individual structs on the Rust side. Definitely big pros and cons to whether the bindings should have individual array exports. It's also possible that there should be a single
yeah... just feels weird to have |
I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?
Alright, notes so far:
as_<geometry_type>
methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.A sample of some of the usage I've gotten out of it:
One thing that could avoid the constant
fooTable.geometry().as_polygon()
unpleasantness is an ES6 Proxy along the lines of:Spawning one of those on class/struct construction, that's one I'm pretty sure is doable.
The text was updated successfully, but these errors were encountered: