Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension for attribute datatypes? #229

Open
ajelenak opened this issue Apr 11, 2023 · 6 comments
Open

Extension for attribute datatypes? #229

ajelenak opened this issue Apr 11, 2023 · 6 comments
Labels
protocol-extension Protocol extension related issue

Comments

@ajelenak
Copy link

Hello!

Hope this is a good place for my question: Has there been any interest before for more explicit Zarr attribute datatypes? My understanding of the Zarr v3 draft specification is that attribute values will be of any valid JSON datatype. Is the expectation that casting, for example, of an attribute scalar value of 1 into any specific software datatype like int32 or uint64 to be done by Zarr readers depending on the context?

Thanks!

@rabernat
Copy link
Contributor

Hi Alex! There has been some discussion on this, yes. I'm going to transfer your issue to zarr-specs just because that's where most of the other discussion is. (Yes, I agree we have too many repo.)

Related

@rabernat rabernat transferred this issue from zarr-developers/zeps Apr 11, 2023
@rabernat
Copy link
Contributor

rabernat commented Apr 11, 2023

Here's my personal idea for the best way to implement this.

We define a zarr v3 extension for an external attributes file (the V3 spec puts attributes in "zarr.json") and allow this to be a binary storage format like CBOR, msgpack, etc. That would solve both the "large amount of metadata" and the "explicit attribute datatypes" issue in one go.

@jbms
Copy link
Contributor

jbms commented Apr 11, 2023

As Ryan noted, we've had a lot of discussions about attributes and a number of solutions have been proposed but no consensus has yet been reached.

I think it may be helpful to distinguish between a number of different issues:

  1. Ability to represent a richer set of data values as attributes than is supported by JSON, e.g. byte strings, non-finite floating point numbers, timestamp, potentially any valid zarr v3 data type, potentially a multi-dimensional array
  2. Explicitly associating a data type, such as int32, uint64 (in some form) with attributes, making it possible to distinguish e.g. 42 of type int32 and 42 of type uint64. There is also the question of whether this typing should apply only to scalars, or if it should be supported for containers as well, e.g. is there just a generic array type or is there array<uint32_t> and array<map<unicode_string, uint32_t>>.

CBOR provides a way to encode a richer set of values (1) but does not by itself provide a way to distinguish between e.g. an int32 and a uint64. However, CBOR does provide a way to associate an arbitrary integer tag with a value, and in principle zarr could define tag values to indicate data types. I don't know how well that would work in practice, e.g. how well it would allow the zarr metadata to be read and written by other tools, but it might work reasonably well.

msgpack similarly provides a way to encode a richer set of values but does not particularly help with (2). It provides an extension mechanism but I don't think the extension mechanism would work well for the purpose of indicating a data type.

@jstriebel jstriebel added the protocol-extension Protocol extension related issue label Apr 11, 2023
@ajelenak
Copy link
Author

Thank you @rabernat for routing my question to a more appropriate place.

@jbms I agree with your breakdown of use cases. For the container cases like array<uint32_t> the approach could be the same as for Zarr arrays: a shape and a datatype.

Can this discussion be combined with one of the mentioned issues or better to keep it separate?

@jbms
Copy link
Contributor

jbms commented Apr 12, 2023

I think you will have to decide yourself whether your comments are closely related to one of the existing issues.

I don't think there is existing an issue specifically related to the idea of storing an explicit data type for each attribute. But I believe that is the approach taken by nczarr (netcdf zarr).

In general, while I can see that there may be some value in being explicit about data types, and that it provides better compatibility with the HDF5 data model, it also seems to me that it would introduce a lot of additional complexity and it is not clear exactly which use cases, other than HDF5 compatibility, benefit from it.

In contrast, merely extending the set of values that can be represented seems more promising.

But if you have a compelling proposal for how to add support for explicit data types, I'd certainly be interested.

@joshmoore
Copy link
Member

Not finding a better issue so I'll cross-reference here an impending need for datetime from bluesky/tiled#514

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

5 participants