Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document special attributes and mapping #13

Open
rly opened this issue Mar 15, 2024 · 10 comments
Open

Document special attributes and mapping #13

rly opened this issue Mar 15, 2024 · 10 comments

Comments

@rly
Copy link
Contributor

rly commented Mar 15, 2024

_REFERENCE
_EXTERNAL_ARRAY_LINK
_SCALAR

and mappings during translation

@bendichter
Copy link

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L543-L549

@magland
Copy link
Collaborator

magland commented Mar 16, 2024

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

https://github.com/fsspec/kerchunk/blob/6fe1f0aa6d33d856ca416bc13a290e2276d3bdb1/kerchunk/hdf.py#L543-L549

This is something we'll need to consider. I'm hesitant to having _ARRAY_DIMENSIONS with phony_dim_x on every dataset... the _SCALAR=True method seems more straightforward. But I do understand that there are benefits of interoperability. Would be helpful to think of a scenario where we'd need that in order to use some tool. I'm hesitant to go down the path where we end up with many attributes supporting all the various projects (kerchunk, lindi, hdmf-zarr)... instead of having logic in the various tools to be able to handle various cases.

@rly
Copy link
Contributor Author

rly commented Mar 16, 2024

In an effort to promote interoperability, would it be possible to use Kerchunk's method for indicating scalar datasets, which it apparently inherited from netCDF4?

fsspec/kerchunk@6fe1f0a/kerchunk/hdf.py#L543-L549

Because information about this is scattered throughout issues and docs, I wanted to summarize:

Xarray is a popular Python library for working with labelled multi-dimensional arrays. It reads and writes netCDF4 files by default (these are specially organized HDF5 files). Xarray requires dimension names to work, and it can read/write them from/to netCDF4 and HDF5 files (it uses HDF5 dimension scales). Scalar datasets in Xarray and netCDF4 are indicated by the lack of dimension names. All netCDF4 datasets have dimension names for non-scalar data and lack dimension names for scalar data, so Xarray and netCDF4 are compatible. But not all HDF5 datasets have dimension names. When Xarray loads an HDF5 dataset without dimension names, it generates phony dimension names for them in memory and on write.

Xarray can also read and write Zarr files, but Zarr does not support storing dimension names, so to write Xarray-compatible Zarr files, Xarray defined a special Zarr array attribute: _ARRAY_DIMENSIONS to store the dimension names. On read of Zarr files, it then looks for that attribute. See https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html .

Kerchunk, in order to generate Xarray-compatible Zarr files, uses the same convention - it creates the attribute _ARRAY_DIMENSIONS. If the dataset is scalar, then the list is empty; else, it sets the list to be phony dimension names.

So adding the _ARRAY_DIMENSIONS attribute (using phony dim names when no dimension scales are present) allows the Zarr data to be read by Xarray and any other tools that adopt the same convention when reading Zarr data. I see the value of following the same convention, but I am also hesitant to adopt it until we have a need for it.

@bendichter
Copy link

Thanks for the summary, @rly! I was not aware of a lot of that.

I do think that Xarray support would be quite valuable, but this may not be the best way to do it. Many of these dataset dimensions really should have names as indicated by the NWB schema.

@rly
Copy link
Contributor Author

rly commented Mar 16, 2024

Just to add: The NetCDF group has their own Zarr implementation called NCZarr. It has its own conventions and I think Xarray supports reading both their .zattrs["_ARRAY_DIMENSIONS"] convention and the NCZarr .zarray["_NCZARR_ARRAY"]["dimrefs"] convention. NCZarr can also write Zarr files following the .zattrs["_ARRAY_DIMENSIONS"] convention if mode=xarray. pydata/xarray#6374 .

IMO, this demonstrates the complexity of having too many different conventions and the danger of adding another. https://xkcd.com/927/

For simplicity, I'm still inclined to follow neither convention until Xarray (or netCDF) is within scope, but perhaps that is naive.

When Xarray loads an HDF5 dataset without dimension names, it generates phony dimension names for them in memory and on write.

Just to add: Technically, Xarray doesn't does this, but both the default I/O engine, netcdf4, and the alternate I/O engine for HDF5 files, h5netcdf, do it.

I don't know why Xarray doesn't generate phony dimension names when reading Zarr arrays without dimension names. That would make things easier...

@magland
Copy link
Collaborator

magland commented Mar 17, 2024

Just to add onto this... custom zarr stores are easy to make... and so one can create adapters that attach the various needed attributes for different contexts. For example, you could have simple adapter that adds the _ARRAY_DIMENSIONS on everything where needed. So you'd have

store = ... some store we are working with
store2 = adapter(store)

with no loss of efficiency.

@magland
Copy link
Collaborator

magland commented Mar 21, 2024

@rly
Copy link
Contributor Author

rly commented Mar 21, 2024

It looks like the Allen Institute for Neural Dynamics would like to use xarray with NWB Zarr files: hdmf-dev/hdmf-zarr#176

@magland
Copy link
Collaborator

magland commented Mar 21, 2024

It looks like the Allen Institute for Neural Dynamics would like to use xarray with NWB Zarr files: hdmf-dev/hdmf-zarr#176

Good to know. As I suggested above, I would propose an adapter that adds the phony_dim _ARRAY_DIMENSIONS attributes, rather than having them in the .zarr.json

@oruebel
Copy link

oruebel commented Mar 21, 2024

that adds the phony_dim _ARRAY_DIMENSIONS attributes

For the case where array dimensions are unknown, I agree that having a way to emulate them rather than storing invalid information is probably preferable. However, in the case of NWB, we can often know the dimensions from the schema so it would be nice to have those reflected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants