Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying a schema in terms of a Protocol #31

Open
rsokl opened this issue Apr 6, 2022 · 2 comments
Open

Specifying a schema in terms of a Protocol #31

rsokl opened this issue Apr 6, 2022 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@rsokl
Copy link

rsokl commented Apr 6, 2022

Hello! Thanks for making xarray-schema!

It would be great to be able to write a xarray schema in terms of a typing.Protocol. This would enable the schema to be used for both runtime and static validations. Let me describe my motivation here (it might already be obvious..)

One challenge with designing a code base that passes around xarray arrays & datasets, which satisfy particular schemas, is: documenting which flavors of datasets are accepted by a given function. Furthermore, for complicated schemas in particular, it is particularly useful for static tools (type-checkers and other IDE tools) to be able to tell a user what attributes do and do not exist for that xarray object.

I have leveraged protocols to tackle these issues. Consider the following protocol that describes a dataset with the coordinates time and feature_component and variables features and temperatures

from typing import Protocol

class DataSetA(Protocol):
    @property
    def time(self) -> xr.DataArray:
        """
        Coordinate, shape-(N,), dtype-int
        """
        ...

    @property
    def feature_component(self) -> xr.DataArray:
        """
        Coordinate, shape-(D,), dtype-int
        The index for each component of a feature vector.
        """
        ...

    @property
    def features(self) -> xr.DataArray:
        """
        Data-Variable, shape-(N, D), dtype-float
        The D-dimensional vector for each feature.
        Coordinates:
          * time [N]
          * descriptor_component [D]
        """
        ...

    @property
    def temperatures(self) -> xr.DataArray:
        """
        Data-Variable, shape-(N,), dtype-float
        
       Temperature measurements.

        shape-(N,) | dtype-float
        Coordinates:
          * feature_id  [N]
        """
        ...

With this, I can write functions like:

def process_dataset(data: DataSetA):
    ...

Not only does this annotation succinctly document to users what flavor of dataset is expected by process_data, static tooling can now auto-complete / statically check the usages of data according to this protocol within the function. This is really nice to have.

It would be great to be able to write DataSetA so that it serves as a schema as well. In this way, DataSetA serves as

  1. Documentation for users
  2. A type that can be understood by static analysis tooling
  3. A schema for runtime validation.

Obviously, this would involve substantially more sophisticated return types for the coordinates and data variables, beyond xr.DataArray. Shape and dtype info would need to be specified as well. Perhaps particular forms of Annotated[xr.DataArray, ...] would suffice.

Finally, I have to flag a substantial shortcoming of DataSetA: it doesn't "look" like a proper xarray.Dataset to static analysis tools. E.g. .loc, .sel don't exist. So really, there needs to be proper protocols that describe xarray.DataArray and xarray.Dataset, which can be subclassed by the likes of DataSetA to remedy this. It isn't clear to me that xarray itself would ship such protocols, or if xarray-schema would do so.

Thanks for reading this post. I'll be interested to hear your thoughts on this!

@rsokl
Copy link
Author

rsokl commented Apr 9, 2022

I decided to open an issue on xarray to propose that they implement protocols for Dataset and DataArray.

pydata/xarray#6462

@jhamman
Copy link
Contributor

jhamman commented Sep 14, 2022

Sorry @rsokl for missing your post for so long. I think this is an interesting idea and one worth exploring. @andersy005 has also thought of something similar in the context of pydantic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants