Generating references without kerchunk #78

TomNicholas · 2024-04-10T19:34:01Z

VirtualiZarr + zarr chunk manifests re-implement so much of kerchunk that the only part left is kerchunk's backends - the part that actually generates the byte ranges from a given legacy file. It's interesting to imagine whether we could make virtualizarr work without using kerchunk or fsspec at all.

#61 (comment) discusses how the rust object-store crate might allow us to read actual bytes from zarr v3 stores with chunk manifests over S3, without using fsspec.

The other place we use fsspec (+ kerchunk) is to generate the references in the first place. But can we imagine alternative implementations for generating that byte range information?

Arguments for doing this without using kerchunk + fsspec are essentially:

increased reliability
clearer interfaces
not using an overly complex tool (i.e. fsspec, which can read from all sorts of systems) to read from just two places (local or S3)
possible performance increases during reference generation (though this is unlikely to be a major bottleneck)
separation of concerns - if we can find other libraries that generate the byte range information already for their own purposes (e.g. h5py using the ros3 driver, hidefix, or cog3pio), we might be able to avoid bearing that maintainance burden, which would be great.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-04-16T17:22:36Z

From @sharkinsspatial:

I just had a quick look at using hidefix to support ChunkManifest generation without kerchunk. Digging in a bit more detail, IIUC hidefix is still using the core HDF5 library to generate an Index object which can then be used to bypass the HDF5 concurrency limitations.
As we are only iterating over the chunk offsets and not actually reading any data, this doesn’t provide us any advantage over h5py. Given that, I’ll try to focus on just creating an h5py based ChunkManifest lib for the near term.

sharkinsspatial · 2024-04-16T18:55:12Z

I took a quick, first look at Virtualizarr last night (amazing 🎊 , thank you for pushing this forward). I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

A few questions for creating a PR for this.

kerchunk's SingleHdf5ToZarr will generate corresponding Zarr groups for HDF5 groups in a file. I may be misunderstanding the open_virtual_dataset logic but it is not currently attempting to build Dataset containers representing nested HDF5 groups in a file correct? If I am understanding this correctly, is this something we do want to support or should variables in nested HDF5 groups be flattened.
As this PR would only replace kerchunk use for ChunkManifest generation for HDF5 files, we would still require the existing logic for using KerchunkStoreRefs for other formats. What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

TomNicholas · 2024-04-16T19:12:36Z

I can re-purpose/simplify the existing kerchunk HDF backend to directly support generating ChunkManifests .

That would be awesome, thanks @sharkinsspatial ❤️ . I think whilst we're doing this we should try hard here to improve test coverage and understanding of behaviour in nasty cases. I'm thinking about #38 in particular.

is this something we do want to support or should variables in nested HDF5 groups be flattened.

This is totally equivalent to the xarray.Dataset vs xarray.DataTree correspondence. We should actually add a open_virtual_datatree function which opens all the groups, and add a (optional) group kwarg to open_virtual_dataset. I'll make a new issue for groups now. (See also #11)

What would be the least intrusive way to incorporate this, some format specific branching logic directly in open_virtual_dataset?

I think what you're suggesting is the least instrusive way. The main thing is to keep code that actually depends on the kerchunk library isolated from the rest of the code base. I would perhaps make a new readers directory or something to distinguish it from the kerchunk.py wrapper of the kerchunk backends.

This was referenced Apr 10, 2024

Generating references from files in S3 (using kerchunk + fsspec) #61

Open

Using hidefix to determine byte ranges in HDF files? gauteh/hidefix#38

Open

TomNicholas added the enhancement New feature or request label Apr 10, 2024

TomNicholas mentioned this issue Apr 10, 2024

Using cog3pio to determine byte ranges in COG files? weiji14/cog3pio#16

Open

TomNicholas mentioned this issue Apr 16, 2024

Support for groups #84

Open

TomNicholas added the references generation Reading byte ranges from archival files label Apr 17, 2024

sharkinsspatial mentioned this issue Apr 22, 2024

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating references without kerchunk #78

Generating references without kerchunk #78

TomNicholas commented Apr 10, 2024 •

edited

TomNicholas commented Apr 16, 2024

sharkinsspatial commented Apr 16, 2024

TomNicholas commented Apr 16, 2024 •

edited

Generating references without kerchunk #78

Generating references without kerchunk #78

Comments

TomNicholas commented Apr 10, 2024 • edited

TomNicholas commented Apr 16, 2024

sharkinsspatial commented Apr 16, 2024

TomNicholas commented Apr 16, 2024 • edited

TomNicholas commented Apr 10, 2024 •

edited

TomNicholas commented Apr 16, 2024 •

edited