Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent GeoTable abstraction for JS target #620

Open
H-Plus-Time opened this issue Apr 18, 2024 · 9 comments
Open

Equivalent GeoTable abstraction for JS target #620

H-Plus-Time opened this issue Apr 18, 2024 · 9 comments

Comments

@H-Plus-Time
Copy link
Contributor

I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?

Alright, notes so far:

  1. It is definitely impossible to return a geometry column of a defined type to JS solely through wasm-bindgen, originating from a dyn ChunkedGeometryArrayTrait - a wrapper with per-geometry, fallible as_<geometry_type> methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.
  2. Geospatial operations that always return the same geometry type are perfectly fine, both directly on the GeoTable and... I'm calling it UnknownGeometryVector for the moment. Type-hinting would be quite good for those.
  3. sjoins would probably be equally as inscrutable as they usually are.

A sample of some of the usage I've gotten out of it:

const dataset = await new ParquetDataset(urls);
const table = await dataset.read(readOptions);
const basicMetrics = [table.centroid(), table.convexHull()];
const geometryColumn = table.geometry().as_polygon();
const directlyAccessedMetrics = [
  geometryColumn.centroid(), geometryColumn.convexHull()
];

One thing that could avoid the constant fooTable.geometry().as_polygon() unpleasantness is an ES6 Proxy along the lines of:

const proxiedTableInstance = new Proxy(tableInstance, {
    get(target, property) {
        if (property === 'geometry_column') {
            // Intercept calls to geometry_column() and automatically call as_<geometryType>()
            return (...args) => {
                const geometryObject = target.geometry_column(...args);
                const asMethod = `as_${target.geometryType}`;
                if (typeof geometryObject[asMethod] === 'function') {
                    return geometryObject[asMethod]();
                } else {
                    throw new Error(`Method ${asMethod} not found in geometry object.`);
                }
            };
        }
        return target[property];
    },
});

Spawning one of those on class/struct construction, that's one I'm pretty sure is doable.

@kylebarron
Copy link
Member

kylebarron commented Apr 18, 2024

Thanks for the issue! I definitely welcome some thoughts on the API design.

I may have played around with a wasm-bindgen'd equivalent of the python GeoTable (notes on that in a bit) - that's one of the goals of the JS target, yeah?

I'm not sure if you've noticed, but I've renamed the underlying Rust struct to Table and was planning to also rename the Python class to Table as well. This enables a few things:

  • A Table is just an Arrow table. The only thing that marks it as spatial is the special metadata on the field.

  • The JS and Python Arrow definitions can be more modular and not tied to geospatial. This is why I have the https://github.com/kylebarron/arrow-wasm repo, as a building block library for any Arrow and Rust-based Wasm libraries, not just spatial. And I've been thinking about doing the same for Python. Certainly not with the scope of pyarrow, but as a lightweight library for operating with Arrow data and reading to and writing from Parquet and IPC. Then we'd bind to and reuse that library from our own spatial Python modules

  • A table works just the same with spatial and non-spatial data, and with multiple geometry columns. Say you're reading a non-standard Parquet file where geometry is encoded as WKT. That would fail with the current Table abstraction because it requires a known geometry column in GeoArrow representation. With a fully general Table representation, you could do

     table = read_parquet('path/to/file.parquet')
     new_table = table.append_column(parse_wkt(table.column('wkt_geometry'), 'geometry')

    new_table would then be inferred as having a geometry column because of the geoarrow.* tag on the new column.

    Actually the API here would be a bit more annoying because there's no way for a column (a Vec<Arc<dyn Array>>) to maintain its own field metadata. So more likely we'd need parse_wkt to return a tuple of ChunkedArray, Field.

    Actually thinking about this a bit more, this wouldn't be necessary (in Python) as long as append_column supports the PyCapsule interface, where the result of parse_wkt would be something like a ChunkedPointArray, which would then have its own __arrow_c_stream__ method, which includes the field metadata.

    It would be nice to have an equivalent in JS to the Python PyCapsule interface, but this is harder to do in JS because of the memory sandboxing between JS and Wasm. But it would be nice to solve this so that functions can transparently take in either existing wasm data or Arrow JS instances. I started the mocking/duck typing of Arrow JS here https://github.com/kylebarron/arrow-wasm/tree/main/src/arrow_js. That allows for directly copying Arrow JS into arrow-rs without going through IPC, and is kinda a reverse of https://github.com/kylebarron/arrow-js-ffi without having to implement writing to the C data interface from scratch.

  1. It is definitely impossible to return a geometry column of a defined type to JS solely through wasm-bindgen, originating from a dyn ChunkedGeometryArrayTrait - a wrapper with per-geometry, fallible as_<geometry_type> methods is necessary. Maybe one day we'll get enums that are allowed to contain data :-/.

It's not impossible if you do dynamic downcasting. E.g. in Python, the existing method on the table called def geometry(self) returns one of the concrete classes, either ChunkedPointArray or ChunkedLineStringArray etc. In JS we'd do similar to return PointVector etc.

See

Python::with_gil(|py| chunked_geometry_array_to_pyobject(py, chunked_geom_arr))
and
pub fn chunked_geometry_array_to_pyobject(
py: Python,
arr: Arc<dyn ChunkedGeometryArrayTrait>,
) -> PyGeoArrowResult<PyObject> {
let py_obj = match arr.data_type() {
GeoDataType::Point(_) => ChunkedPointArray(arr.as_ref().as_point().clone()).into_py(py),
GeoDataType::LineString(_) => {
ChunkedLineStringArray(arr.as_ref().as_line_string().clone()).into_py(py)
}
GeoDataType::Polygon(_) => {
ChunkedPolygonArray(arr.as_ref().as_polygon().clone()).into_py(py)
}
GeoDataType::MultiPoint(_) => {
ChunkedMultiPointArray(arr.as_ref().as_multi_point().clone()).into_py(py)
}
GeoDataType::MultiLineString(_) => {
ChunkedMultiLineStringArray(arr.as_ref().as_multi_line_string().clone()).into_py(py)
}
GeoDataType::MultiPolygon(_) => {
ChunkedMultiPolygonArray(arr.as_ref().as_multi_polygon().clone()).into_py(py)
}
GeoDataType::Mixed(_) => {
ChunkedMixedGeometryArray(arr.as_ref().as_mixed().clone()).into_py(py)
}
GeoDataType::GeometryCollection(_) => {
ChunkedGeometryCollectionArray(arr.as_ref().as_geometry_collection().clone())
.into_py(py)
}
GeoDataType::WKB => ChunkedWKBArray(arr.as_ref().as_wkb().clone()).into_py(py),
other => {
return Err(GeoArrowError::IncorrectType(
format!("Unexpected array type {:?}", other).into(),
)
.into())
}
};
Ok(py_obj)
}

  1. Geospatial operations that always return the same geometry type are perfectly fine, both directly on the GeoTable and... I'm calling it UnknownGeometryVector for the moment. Type-hinting would be quite good for those.

If we do dynamic downcasting, this shouldn't be an issue.

  1. sjoins would probably be equally as inscrutable as they usually are.

My expected sjoin API is

sjoin(left_table: Table, right_table: Table, left_geom: str | None, right_geom: str | None, how: "inner" | "outer") -> Table

If left_table and right_table only have one geometry column, then neither left_geom nor right_geom need to be provided.

@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Apr 19, 2024

Oops, you're indeed right (I was stuck between a combination of tunnel vision on returning enums and juggling the exact sequence of castings to get Arc down to JsValue) - a match statement filled with $WasmVectorName(foo.as_ref().as_$geom_name().clone()).into() does the trick 🤦 . Cool, I'll override the type definition for that method too to the union of result Wasm vectors, that should sort it.

That sjoin API, that'd also include the predicate right? (within, intersects, etc).

@kylebarron
Copy link
Member

That sjoin API, that'd also include the predicate right? (within, intersects, etc).

Yes, that too

@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Apr 27, 2024

Alright, so agreed re naming ('Table'), there's a few scenarios for how I'd see this working, with varying degrees of re-use/integration and usability:

Option 1 - IO methods always return geoarrow_wasm::Table, with fallible geometry methods

The (very limited) spatial-related methods (presumably including one that details the geometry containing columns (index, name, detected geometry type) would need good handling for the spatial-free case of course.

Option 2 - explicit geoarrow::Table construction from arrow_wasm::Table, mostly infallible geometry methods

IO methods return a JsValue, via an intermediate match-case conversion to geoarrow_wasm::table::Table or arrow_wasm::table::Table - this would require a way to convert between them in JS and deal with manual conversions, with something like:

const plainTable = parquetFile.read() // just an arrow_wasm table at this point, no spatially aware methods available
const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry')) // one of the geoarrow-wasm geometry vectors
const outputTable = wasm.Table.from(plainTable).append_column(geometryColumn, 'geometry') // geoarrow-wasm's Table, geometry methods will only produce meaningful results after the append_column.
// alternatively
const outputTable = wasm.Table.from_arrow_and_geometry(plainTable, {'geometry': geometryColumn}) // or something to that effect

Option 3 - arrow_wasm knowledge of geoarrow_wasm, bidirectional auto-conversion

A feature-flag in arrow-wasm, adding a dependency on geoarrow-wasm (! cyclic dependency, cargo might be okay with these 🤷 ), enabling as little as possible. This would be required for geoparquet-wasm (and sibling) IO methods to return arrow_wasm::Table results that are capable of inferring and auto-casting this:

const geometryColumn = wasm.parseWkt(plainTable.column('wkt_geometry');
plainTable.append_column(geometryColumn, 'geometry');

to a geoarrow_wasm::Table, not an arrow_wasm::Table. The equivalent in reverse (outputTable.remove_column('geometry') which should downgrade a geoarrow_wasm::Table to arrow_wasm::Table) is much less of a problem since arrow_wasm is already a dependency.


Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.

Option 2 is very close to geopandas' and pyogrio's approach - IO methods usually produce GeoDataframes but can produce plain old pandas dataframes (e.g. read_geometry=False, no geometry column found, etc), pandas dataframes can be consumed into GeoDataframes explicitly (but are not auto-upgraded by adding a GeoSeries column to a pandas DataFrame), GeoDataframes are implicitly downgraded once they lose their geometry column (by column slicing, drop, overwriting with wkb, etc).

@H-Plus-Time
Copy link
Contributor Author

It would be nice to have an equivalent in JS to the Python PyCapsule interface, but this is harder to do in JS because of the memory sandboxing between JS and Wasm

Agreed that it's harder to do for objects crossing memory sandboxes, but you are pretty likely to have parse_wkt and Table::append_column functions as part of the same wasm module - still very useful to be able to glue them together with JsCapsules, but no memory sandboxing problems. That specific form should be pretty doable.

@kylebarron
Copy link
Member

I'm not inclined to have separate structs for a Table in multiple crates. My API surface is large enough, that I don't want to have multiple objects, where we might have to re-implement non-spatial table operations on both Table and a spatially-enabled Table. Especially for Python/JS bindings where we can't use traits for the binding definitions.

I think my plan is none of the above. I don't want a geoarrow_wasm::Table to need to exist. A GeoArrow table is "just" an Arrow table with geoarrow metadata on one or more geometry columns.

We can loop through column metadata to look for the geometry column, or the user can pass in the column name/index. This is how the current geoarrow::table::geometry method works. It returns an Option<Arc<dyn ChunkedGeometryArrayTrait>>.

To be clear, the geoarrow::table::Table struct only exists in geoarrow because the arrow crate doesn't define a Table object, but for parity with Python and JS, I feel it's important for that concept to exist.

So I think all geoarrow-wasm IO methods should return an arrow-wasm::Table. And in this case I think having a top-level function geometry(table: Table, index?: null | string | number) -> Vector makes sense, where a user can easily access the geometry column of a table.

Essentially, this pushes us more towards a functional API that takes in a table as the first argument rather than a method API. This is even more useful in Python where we can interpret any other Arrow object with zero copy. So we want to work with any arrow object, whether it's one we define or one defined by pyarrow or gdal or polars or duckdb or whatever.


Thinking about the Pandas/GeoPandas split, I'd argue that the big distinction here is that Pandas didn't originally have first-class support for extension types. You weren't able to add associated metadata to columns or tables.

I'm really not a fan of geopandas and pyogrio behavior here. I feel strongly that there should be a statically inferred output type for any input function. Perhaps there should be two overloads that return different output types for a different combination of input types, but at least it's known at compile time. I'm not fond of returning two different Python object types.


p.s. I'm exhausted as I'm typing this so it's possible I didn't explain myself as well as I could've. Feel free to ask more questions

@kylebarron
Copy link
Member

Option 1 is the least trouble, though it provides the least confidence that a geoarrow_wasm::Table's spatial methods will actually do anything.

At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?

I'm also pondering this around our current approach of having different exports for every geometry array type. When we add Z/M dimension support, should we have PointArrayZ, PointArrayM, PointArrayZM exports as well? That's a huge amount of complexity.

E.g. on the Python side, I've been thinking a rust-python Arrow building block library like arrow-wasm might make sense. Then maybe we have generic Table, Array, ChunkedArray, Field and Schema exports from that module and only GeometryArray and ChunkedGeometryArray exports from geoarrow.rust.core. That would massively simplify exports (in theory maybe we don't even need a GeometryArray class if we have a purely-functional API) at the cost of not having as great static typing.

But at this point, I'm coming to believe that reducing maintenance complexity will be massively needed. I believe these GeoArrow-based projects are a multi-year endeavor (thinking ~5 years to maturity) and I need to ensure that I don't burn myself out. So reducing geometry-specific class exports will hopefully both improve interoperability and reduce API surface area.

@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Apr 29, 2024

Excellent, that makes things much, much clearer.

Cool, so toy usage more or less look like:

// named exports semi-fantasised
import { geometry, centroid, convexHull, sjoin } from 'geoparquet-wasm';
// mandatory functional
const foo = geometry(table);

// mixed functional/method-based, what we have today on things like PointVector, etc
const intermediate = foo.convexHull().centroid() // semi-pointless, demonstrates method-chaining.

// fully functional approach
const intermediate = centroid(convexHull(foo)); // isomorphic with the above, easy to use with functional composition.

// mandatory functional
const intersectionTable = sjoin(table, otherTable, {how: 'inner', predicate: 'intersects'});

I need to ensure that I don't burn myself out

☝️ hard agree on this 😮‍💨

At some level, this question is: is it worth the development overhead of special-casing tons of geospatial-specific table/array objects when it could alternately be represented as a generic array + metadata?

I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.

Z/M dimension support

Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck".

@kylebarron
Copy link
Member

Cool, so toy usage more or less look like:

Yes, absolutely.

I need to ensure that I don't burn myself out

☝️ hard agree on this 😮‍💨

🙏

I think cutting geospatial-specific tables is the right call, but the individual geometry type arrays are useful (and have the advantage of having already been implemented), and highly macro-able. There's also just enough variation between them (that is, a very small subset of relations/predicates that are undefined, nonsensical, or only defined on a given geometry type) to justify treating them separately.

I think that they will always be individual structs on the Rust side. Definitely big pros and cons to whether the bindings should have individual array exports. It's also possible that there should be a single GeometryArray export that just stores a Arc<dyn GeometryArrayTrait> but then have additional types on the typescript side. I'm not sure.

Yeah, independent Z, M, Z+M could be one that we throw up our hands and say "static typing on standard-ish geometry arrays is valuable enough to support. these? even with individually defined types, static typing is extremely likely to fail altogether when mixing them. Here's GeometryArrayZ, GeometryArrayM, GeometryArrayZM, best of luck".

yeah... just feels weird to have PointArray but then GeometryArrayZ being an enum over types on the rust side. but maybe that's a good option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants