Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimentally support CRSIndex #588

Open
dcherian opened this issue Oct 14, 2022 · 21 comments
Open

Experimentally support CRSIndex #588

dcherian opened this issue Oct 14, 2022 · 21 comments
Labels
proposal Idea for a new feature.

Comments

@dcherian
Copy link

dcherian commented Oct 14, 2022

It'd be nice to experiment some with this prototype CRSIndex (though there are still bugs and work to do)

We could experiment with optionally adding such an index through rioxarray.open_dataset(..., use_crsindex=True) and then get some experience with it before refactoring it out to a ecosystem package like geoxarray

Reprojection seems like a high value application where assigning a new CRSIndex would be quite impactful.

Is there an existing example that might benefit a lot from propagating CRS information automatically?

cc @djhoese

Original Xarray Issue - "Add CRS/projection information to xarray objects"

@snowman2
Copy link
Member

Thanks for making the prototype. That looks pretty nice.

Do you think it would be better to wait before officially adding it to rioxarray so some wrinkles can be ironed out first? Or do you think that adding it to rioxarray will help the development process for indexes in xarray?

Thoughts:

  • Add the CRSIndex in rioxarray in the rio.write_crs method. That is used everywhere and should propagate to all of the rioxarray methods that use the CRS.
  • As an initial implementation, we could start with warnings instead of errors. This way, we aren't preventing operations from working that should work if the user really knows what they are doing. Later on, as we gain more experience with indexes and feel more confident in the implementation, we can update to exceptions.
  • We may want to ignore axis order when comparing the CRS with the ignore_axis_order kwarg.

@dcherian
Copy link
Author

adding it to rioxarray will help the development process for indexes in xarray?

I think this will be a very useful experiment.

dd the CRSIndex in rioxarray in the rio.write_crs method. That is used everywhere and should propagate to all of the rioxarray methods that use the CRS.

IMO it would be good to try this and see what fails in the test suite. That would be a more exhaustive test than my suggestion to pick a single "workflow" example./

@snowman2
Copy link
Member

Sounds like a good plan to me.

Anyone who would like to try implementing the CRSIndex and adding it to rio.write_crs is welcome to, If you do get started, please post a comment here so we can coordinate our efforts. I will make sure to post in the thread if I get some free time to work on it.

@scottyhq
Copy link
Contributor

scottyhq commented May 1, 2023

Anyone who would like to try implementing the CRSIndex and adding it to rio.write_crs is welcome to, If you do get started, please post a comment here so we can coordinate our efforts.

@snowman2 we've finally been making some headway trying this out. Currently testing things on my fork in this PR scottyhq#1, and borrowing from @benbovy's linked xvec implementation and @dcherian's prototype notebook. The example notebook in the PR highlights some use-cases and goals (https://github.com/scottyhq/rioxarray/blob/crsindex/docs/examples/crs_index.ipynb).

@snowman2
Copy link
Member

snowman2 commented May 1, 2023

Nice 👍

My first thought is to make use_crs_index=True/False a global option ref. Another idea was to add a crs_index_error=True/False global option to raise an exception if enabled and the CRS are different, otherwise raise a warning.

@dcherian
Copy link
Author

dcherian commented May 1, 2023

make use_crs_index=True/False a global option

Could this screw up a downstream package though? Or maybe we want that to shake out all the bugs :P

crs_index_error=True/False global option to raise an exception if enabled and the CRS are different,

It may be prudent to be strict now and loosen things later. If fixing the error is simply .rio.write_crs(...), then the error message could print out whats needed.

@snowman2
Copy link
Member

snowman2 commented May 2, 2023

Could this screw up a downstream package though?

It works as a context manager and so it can be applied locally as well. Though, the current strategy would work as well. And as you mentioned, it would be good to have a way to do a thorough bug search.

@snowman2
Copy link
Member

snowman2 commented May 3, 2023

It may be prudent to be strict now and loosen things later.

Since it is optionally enabled (and off by default), being strict is fine.

@dcherian
Copy link
Author

FYI @scottyhq and I are thinking of hacking on this during the SciPy sprints. let us know if anyone interested in this stuff will be around

@benbovy
Copy link

benbovy commented Sep 19, 2023

I'm thinking about different use cases that would all benefit from an Xarray CRS-aware index but where the kind of index would greatly differ from one case to another:

Instead of duplicating the CRS-related logic in all those indexes, I'm wondering if alternatively we could provide somewhere a CRS "meta" index so that we can combine it with other indexes like CRSIndex(GeometryIndex), CRSIndex(RasterIndex), CRSIndex(KDTreeIndex), etc (not class inheritance but rather composition). The basic idea is to handle all CRS stuff in the index wrapper and delegate the rest to the wrapped index. Together with @keewis we are currently experimenting on a unit-aware (pint) index using that approach: xarray-contrib/pint-xarray#163 (WIP). It seems to work well.

Some open questions:

  • Is is really possible to have a CRSIndex that is agnostic of the type of the index object it wraps? Dealing with the various ways to represent the CRS may be challenging
  • Where should we implement such CRSIndex? https://github.com/geoxarray/geoxarray? How to avoid cyclic dependencies?

cc @djhoese @martinfleis

@benbovy
Copy link

benbovy commented Sep 19, 2023

Is is really possible to have a CRSIndex that is agnostic of the type of the index object it wraps?

It is not that easy, actually. We need to get the actual CRS information somewhere in order to build the CRSIndex. It is more manageable if each package (xvec, rioxarray, etc.) takes care of that rather than having a common package trying to support all different ways / standards to get that information. Also it is possible that the wrapped index needs to access the CRS for some computation (for example, would it be useful to have something like GeoBox of https://github.com/opendatacube/odc-geo embed in a raster index?)

@djhoese
Copy link
Contributor

djhoese commented Sep 19, 2023

Forgive anything that is obvious to you and isn't to me about xarray Index objects. I've never created one and have never used a custom one.

I like your layout of special types of indexes and wrapping (decorating?) them with CRS information. I assume the main reason an index like these would need to know about the CRS is to allow or disallow or adapt to two xarray objects being merged? Like one of the examples in https://github.com/scottyhq/rioxarray/blob/crsindex/docs/examples/crs_index.ipynb where it shows doing + between two Datasets that have different CRSes and getting an error. Right? Besides that, is there anything that the Index (and only the Index) needs to do with the CRS?

You mention odc-geo's GeoBox (or other shapely-like geometry objects) being embedded in the Index. What does an Index do that would mean it needs access to that versus a odc-geo or geoxarray accessor having a property or method that produces a GeoBox?

Also you mention cyclic dependencies. Are you saying that if CRSIndex when into geoxarray that would lead to a cyclic dependency with rioxarray?

@martinfleis
Copy link

I find the CRSIndex(GeometryIndex), CRSIndex(RasterIndex), CRSIndex(KDTreeIndex) logic a bit tricky. Because a user only sees CRSIndex, not what it is subclassing. So if we have some methods on the GeometryIndex but they are not on the KDTreeIndex, the two classes called CRSIndex would behave differently.

If we should do subclassing, then maybe the other way? That would allow us to have a common subset of functionality dealing with CRS and custom index handling on top of that. The core class may not even be an index but some form of a CRSHandler that is a bit like pyproj.crs.CRS class but potentially detached from PROJ.

@benbovy
Copy link

benbovy commented Sep 19, 2023

Forgive anything that is obvious to you and isn't to me about xarray Index objects. I've never created one and have never used a custom one.

No worries at all! This is still very experimental and there's not much documentation or reading material about how to create and use them. TBH I also have no clear proposal for a "reusable" CRSIndex yet and I might be missing obvious CRS or other domain-specific things.

I assume the main reason an index like these would need to know about the CRS is to allow or disallow or adapt to two xarray objects being merged?

Yes I guess that's one of the main reasons. It could be used also for validating inputs given to Dataset (DataArray) .sel or even re-project on-the-fly, although I don't think that the latter is a good idea. Maybe other use cases? For example, it is possible to have fine control on how the coordinates are propagated when performing operations on Datasets (DataArrays) via Index.create_variables.

What does an Index do that would mean it needs access to that versus a odc-geo or geoxarray accessor having a property or method that produces a GeoBox?

That is the question, actually. I don't know if/how a GeoBox could be used by an index. Support data selection by passing world coordinates and convert it as pixel coordinates (inverse affine transformation)? Support more advanced alignment behavior than allow / disallow?

About index vs. accessor, I think it relates to the more general question of whether a GeoBox is unique to a Dataset (DataArray) or is more closely related to a subset of coordinates (and to a less extent how hard / expensive it is to rebuild from scratch).

I find the CRSIndex(GeometryIndex), CRSIndex(RasterIndex), CRSIndex(KDTreeIndex) logic a bit tricky. Because a user only sees CRSIndex, not what it is subclassing.

Sorry the CRSIndex(IndexType) notation was confusing. I meant a CRSIndex encapsulating another Xarray index, not subclassing.

@djhoese
Copy link
Contributor

djhoese commented Sep 21, 2023

It could be used also for validating inputs given to Dataset (DataArray) .sel or even re-project on-the-fly, although I don't think that the latter is a good idea. Maybe other use cases? For example, it is possible to have fine control on how the coordinates are propagated when performing operations on Datasets (DataArrays) via Index.create_variables.

What type of validation of inputs did you have in mind for .sel? I could see units being checked, but I suppose that's more in the realm of your pint-enabled indexes and as long as the pint index uses the same units as the CRS then compatibility might come "for free". Otherwise in my experience if coordinates are outside the projection space they would be NaN or wouldn't be within the range of the index values anyway. Anyway, the merging of two incompatible CRS-based DataArrays is a good enough reason for me I think...assuming implementing this isn't very hard.

About index vs. accessor, I think it relates to the more general question of whether a GeoBox is unique to a Dataset (DataArray) or is more closely related to a subset of coordinates (and to a less extent how hard / expensive it is to rebuild from scratch).

If an Index is used by multiple xarray objects, is it the same instance of that index or does it get copied? I see a GeoBox or any other type of polygon as being a representation of the data coordinates, but separate from the Index. You could have an Index that uses a GeoBox internally I suppose. You could also have shapely polygons produced in various CRSes which is why I felt like you need access to the full DataArray coordinates and possibly attrs. Seems more accessor-y than index-y, but yeah maybe not familiar enough to comment beyond how this "feels" to me.

@benbovy
Copy link

benbovy commented Sep 26, 2023

What type of validation of inputs did you have in mind for .sel?

CRS validation, assuming that the input labels passed to .sel can be in the form of DataArrays each with (or without) a CRS attached to it.

I see a GeoBox or any other type of polygon as being a representation of the data coordinates, but separate from the Index.

Yes, although it actually often serves as a rough index itself.

You could also have shapely polygons produced in various CRSes which is why I felt like you need access to the full DataArray coordinates and possibly attrs. Seems more accessor-y than index-y, but yeah maybe not familiar enough to comment beyond how this "feels" to me.

I'll try to explain why this feels index-y to me (sorry in-advance if I'm repeating things for which you are already familiar). Beyond what we can expect from an "index", a custom Xarray Index is an arbitrary object that is closely tied to one or more coordinates and that offers some additional flexibility and benefit:

  • We can directly attach to the Index any additional representation of the data coordinates without relying on some workarounds like duplicating metadata across x/y, lat/lon, etc. coordinates or storing metadata in another dummy coordinate like "spatial_ref". This representation can be any arbitrary object, which is generally more flexible than the coordinate attribute values. Of course, at some point we'll likely need to convert this representation to/from a more serializable form via the coordinates and their attributes, but it is often more convenient to handle an arbitrary object designed for a specific purpose than trying to fight with a more generic but limited data model.
  • Via its API, an Index provides full control on how it is propagated (passed, copied, dropped, etc.) from one DataArray (Dataset) to another. Special cases may be handled more easily than relying on Xarray built-in rules for propagating attributes. Unlike a DataArray accessor, we don't need to rebuild the arbitrary object(s) for each new DataArray. In addition, an Index provides full control on how its coordinates are propagated, via Index.create_variables(), which is useful if we still want to have extra information always kept in-sync in coordinate attributes.

If a representation (object) is specific to a whole DataArray (Dataset) and is cheap to re-build, then an accessor may be a good solution. If the representation is more specific to a subset of coordinates, then a custom Xarray index for those coordinates seems the right approach IMO. This doesn't prevent accessing that object from an accessor, though. I'd rather see a GeoBox encapsulated in a geo Index and accessed via an accessor than the other way around... Accessors are good for extending the DataArray (Dataset) API but not so good for extending its data model.

@djhoese
Copy link
Contributor

djhoese commented Sep 26, 2023

I'd rather see a GeoBox encapsulated in a geo Index and accessed via an accessor than the other way around... Accessors are good for extending the DataArray (Dataset) API but not so good for extending its data model.

I like this point.

I see from the xarray docs that we could have a meta index (or maybe there are other ways too) so we can have access to both x and y coordinates in a single CRS-based index. I could see this being important for having internal polygons or an index that represents the coordinates in a different CRS (or allows for inputs from other CRSes).

In your original recent comment from last week you said:

Dealing with the various ways to represent the CRS may be challenging

I'm not sure we ever exactly addressed this. Your later comment said:

It is not that easy, actually. We need to get the actual CRS information somewhere in order to build the CRSIndex. It is more manageable if each package (xvec, rioxarray, etc.) takes care of that rather than having a common package trying to support all different ways / standards to get that information

What various ways of representing the CRS were you thinking? I think rioxarray and geoxarray and other libraries should all (hopefully) be depending on pyproj's CRS object. Any conversion from various ways to represent a CRS should be handled by that. The individual packages and their potential engines or accessors may create the CRS index that is then provided to an index, but I don't see a reason (yet) why that CRS creation can't be part of a common package (geoxarray). And at that point if this common library is being depended on to create the CRS then it should probably just have a utility (accessor or function or other) for adding the CRS-based index for you with the CRS that it parses/loads.

@benbovy
Copy link

benbovy commented Sep 26, 2023

Sorry that was confusing!

I think rioxarray and geoxarray and other libraries should all (hopefully) be depending on pyproj's CRS object.

Agreed! Xvec uses pyproj's CRS too.

Let me try to illustrate my suggestion with an example.

CRSIndex encapsulates another arbitrary index and takes care of checking CRSes during alignment:

from typing import Any, Generic, Hashable, TypeVar

import xarray as xr
from pyproj import CRS
from xarray.indexes import Index


T = TypeVar("T", bound=Index)


class CRSIndex(Index, Generic[T]):
    
    def __init__(self, index: T, crs: Any):
        self._index = index
        self._crs = CRS.from_user_input(crs)
            
    def equal(self, other: Index):
        if not isinstance(other, CRSIndex):
            return False
        if self._crs != other._crs:
            return False
        return self._index.equals(other._index)
    
    def join(self, other: "CRSIndex[T]", how: str = "inner") -> "CRSIndex[T]":
        if self._crs != other._crs:
            raise ValueError

        index = self._index.join(other._index, how=how)
        return type(self)(index)

    def reindex_like(
        self, other: "CRSIndex[T]", method=None, tolerance=None
    ) -> dict[Hashable, Any]:
        if self._crs != other._crs:
            raise ValueError

        return self._index.reindex_like(
            other._index, method=method, tolerance=tolerance
        )

We can couple that with an accessor to provide a convenient way to set a CRS index (this is what xvec already does for GeometryIndex):

@xr.register_dataarray_accessor("crs")
class CRSAccessor:
    
    def __init__(self, xarray_obj: xr.Dataset | xr.DataArray):
        self._obj = xarray_obj
    
    def set_crs(self, coord_name: Hashable, crs: CRS):
        crs_index = CRSIndex(self._obj.xindexes[coord_name], crs)
        new_coords = xr.Coordinates(
            {coord_name: self._obj[coord_name].variable}, {coord_name: crs_index}
        )
        return self._obj.assign_coords(new_coords)

Let's create two dataarrays with an xvec.GeometryIndex (note: the latter already has support for CRS but here we intentionally not use it, i.e., crs=None).

import xvec

da1 = xr.DataArray(
    [1, 2],
    coords={"geom": [shapely.Point(1, 2), shapely.Point(3, 4)]},
    dims="geom",
)
da1 = da1.xvec.set_geom_indexes("geom")

da2 = xr.DataArray(
    [3],
    coords={"geom": [shapely.Point(1, 2)]},
    dims="geom",
)
da2 = da2.xvec.set_geom_indexes("geom")

xr.align(da1, da2)
# (<xarray.DataArray (geom: 1)>
#  array([1])
#  Coordinates:
#    * geom     (geom) object POINT (1 2)
#  Indexes:
#      geom     GeometryIndex (crs=None),
#  <xarray.DataArray (geom: 1)>
#  array([3])
#  Coordinates:
#    * geom     (geom) object POINT (1 2)
#  Indexes:
#      geom     GeometryIndex (crs=None))

Now lets wrap those indexes so they become CRS-aware:

da1_crs = da1.crs.set_crs("geom", 26915)
da2_crs = da2.crs.set_crs("geom", 3857)

xr.align(da1_crs, da2_crs)
# ValueError (CRS mismatch)

This is just an example, we could have used a RasterIndex, a KDTreeIndex, etc. instead of a GeometryIndex.

Now, the CRSIndex above doesn't do many things apart from checking if the CRSes match, so maybe it is not a big deal implementing the same logic for RasterIndex, GeometryIndex, etc.? CRSIndex stills provides a nice way to enable CRS-aware alignment via a uniform API (and maybe other features could be added).

Alternatively to explicitly providing the CRS via DataArray.crs.set_crs(), we could extract the CRS from the wrapped index, but this would need to have a common interface for all underlying index types.

@djhoese
Copy link
Contributor

djhoese commented Sep 26, 2023

I could see some functionality being dependent on if you're doing something in 2D or 3D in your projection space. I'm not sure that prevents both sets of functionality from living in one CRSIndex object, but it is one of the first things that came to mind.

Alternatively to explicitly providing the CRS via DataArray.crs.set_crs(), we could extract the CRS from the wrapped index, but this would need to have a common interface for all underlying index types.

Wouldn't that only require a .crs property on the Index? I'd hope that we could all agree on that.

@martinfleis
Copy link

I am still not convinced about this idea. It gets tricky as soon as you want to do something more complicated. What about re-projecting if setting the CRS happens on the CRSIndex level? Shall I do DataArray.crs.set_crs() to set it but DataArray.xvec.to_crs() to re-project to another CRS? I guess that should be (from a user point of view) DataArray.crs.to_crs() but at that point I don't know how that would interact with the wrapped index without additional restrictions on what a common API should look like.

It also feels a bit opaque to see that everything is a CRSIndex rather than specific custom solutions.

@benbovy
Copy link

benbovy commented Sep 27, 2023

I am still not convinced about this idea.

Yeah this idea has still to be more "battle tested" against possible uses cases. I think it is useful to have some way of adding CRS-aware behavior on top of a generic, non-geospatial index. For example, the xoak library will provide (once refactored) indexes that are useful for geospatial and other applications but I think it is outside of the scope of that library to provide any CRS support (and to depend on pyproj).

We could create somewhere else ad-hoc wrappers for each case, e.g., CRSKDTreeIndex(xoak.KDTreeIndex), at the risk that this might lead to a rapid increase of various xarray geospatial indexes in the wild with subtle differences in both API and behavior regarding how they handle the CRS. Are there other possible approaches that would mitigate that risk?

I don't know how that would interact with the wrapped index without additional restrictions on what a common API should look like.

Fair point. We would need to agree on a slightly larger protocol than just a .crs property for a wrapped index to interact with the CRS. Does it seem reasonable? Are there many operations other than set_crs and to_crs that should go through CRSIndex? For more specific CRS-aware operations, we could still leave that to the wrapped index.

What about re-projecting if setting the CRS happens on the CRSIndex level?

In such case, it is probably OK to either raise an error or re-project the coordinate data and then rebuild from scratch the wrapped index?

It also feels a bit opaque to see that everything is a CRSIndex rather than specific custom solutions.

I don't think that it is a big deal here. Users do (should) not interact directly with xarray Index objects. We can also make it more transparent through the CRSIndex (inline) repr, similarly to how we represent xarray coordinate or data variables wrapping (possibly nested) duck arrays. I'd be more concerned by how to discover and pass options to the wrapped index (e.g., pydata/xarray#8002).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants