Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating references without kerchunk #78

Open
TomNicholas opened this issue Apr 10, 2024 · 3 comments
Open

Generating references without kerchunk #78

TomNicholas opened this issue Apr 10, 2024 · 3 comments
Labels
enhancement New feature or request references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Owner

TomNicholas commented Apr 10, 2024

VirtualiZarr + zarr chunk manifests re-implement so much of kerchunk that the only part left is kerchunk's backends - the part that actually generates the byte ranges from a given legacy file. It's interesting to imagine whether we could make virtualizarr work without using kerchunk or fsspec at all.

#61 (comment) discusses how the rust object-store crate might allow us to read actual bytes from zarr v3 stores with chunk manifests over S3, without using fsspec.

The other place we use fsspec (+ kerchunk) is to generate the references in the first place. But can we imagine alternative implementations for generating that byte range information?

Arguments for doing this without using kerchunk + fsspec are essentially:

  • increased reliability
  • clearer interfaces
  • not using an overly complex tool (i.e. fsspec, which can read from all sorts of systems) to read from just two places (local or S3)
  • possible performance increases during reference generation (though this is unlikely to be a major bottleneck)
  • separation of concerns - if we can find other libraries that generate the byte range information already for their own purposes (e.g. h5py using the ros3 driver, hidefix, or cog3pio), we might be able to avoid bearing that maintainance burden, which would be great.
@TomNicholas
Copy link
Owner Author

From @sharkinsspatial:

I just had a quick look at using hidefix to support ChunkManifest generation without kerchunk. Digging in a bit more detail, IIUC hidefix is still using the core HDF5 library to generate an Index object which can then be used to bypass the HDF5 concurrency limitations.
As we are only iterating over the chunk offsets and not actually reading any data, this doesn’t provide us any advantage over h5py. Given that, I’ll try to focus on just creating an h5py based ChunkManifest lib for the near term.

@sharkinsspatial
Copy link
Collaborator

I took a quick, first look at Virtualizarr last night (amazing 🎊 , thank you for pushing this forward). I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

A few questions for creating a PR for this.

  1. kerchunk's SingleHdf5ToZarr will generate corresponding Zarr groups for HDF5 groups in a file. I may be misunderstanding the open_virtual_dataset logic but it is not currently attempting to build Dataset containers representing nested HDF5 groups in a file correct? If I am understanding this correctly, is this something we do want to support or should variables in nested HDF5 groups be flattened.
  2. As this PR would only replace kerchunk use for ChunkManifest generation for HDF5 files, we would still require the existing logic for using KerchunkStoreRefs for other formats. What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

@TomNicholas
Copy link
Owner Author

TomNicholas commented Apr 16, 2024

I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

That would be awesome, thanks @sharkinsspatial ❤️ . I think whilst we're doing this we should try hard here to improve test coverage and understanding of behaviour in nasty cases. I'm thinking about #38 in particular.

is this something we do want to support or should variables in nested HDF5 groups be flattened.

This is totally equivalent to the xarray.Dataset vs xarray.DataTree correspondence. We should actually add a open_virtual_datatree function which opens all the groups, and add a (optional) group kwarg to open_virtual_dataset. I'll make a new issue for groups now. (See also #11)

What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

I think what you're suggesting is the least instrusive way. The main thing is to keep code that actually depends on the kerchunk library isolated from the rest of the code base. I would perhaps make a new readers directory or something to distinguish it from the kerchunk.py wrapper of the kerchunk backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

2 participants