Virtual datasets from Zarr stores #63

maxrjones · 2024-03-28T19:12:09Z

@norlandrhagen and I were discussing creating virtual datasets from zarr stores earlier today (placeholder already in _automatically_determine_filetype). @TomNicholas what are your thoughts on trying out Kerchunk's single_zarr for this purpose? I think this could be a helpful step towards virtual concatenation of Zarr stores and allowing manifests to replace consolidated metadata for V3.

TomNicholas · 2024-03-28T19:20:23Z

I think this is a good idea, that I had a mental issue for already anyway 😅 One neat thing this woudl allow is testing writing to zarr stores as manifests by round-tripping.

Kerchunk's single_zarr

We could use this and it would certainly be the quickest way, but perhaps we might be better off just writing the function ourselves? Otherwise we would be starting with a zarr store, opening it using a dependency (kerchunk and fsspec), get the opened results back as a kerchunk reference dict, then immediately converting that reference dict to ManifestArrays. Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

I suspect also that if we do it that way (the "zarr-native" way), we might later find that either we can import and use code from zarr-python, or zarr-python can take inspiration from code we write here.

TomNicholas · 2024-03-28T19:35:19Z

Actually wait I think I misunderstood what you were suggesting @maxrjones . There are two types of Zarr stores we could read byte ranges from:

Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)
Zarr stores containing manifest.json files. This is what my comment above was referring to. This would be the inverse operation to vds.virtualize.to_zarr().

maxrjones · 2024-03-28T19:42:02Z

Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

When you say "read the zarr json manually" do you mean using zarr-python? I'm not sure how just loading the .zarray (V2) or zarr.json (V3) would work because it doesn't tell you which chunks are initialized.

maxrjones · 2024-03-28T19:45:29Z

Oops, didn't notice your comment before posting my last response.

Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)

I was referring to this case, which seems like simpler to implement right now. Although (2) is also important.

TomNicholas · 2024-03-28T19:47:06Z

I was referring to this case, which seems like simpler to implement right now.

Yep! If you want to make a PR for case (1) then go for it!

Although (2) is also important.

Yeah, but maybe I'll just add that in as part of #45.

jhamman · 2024-03-28T20:46:19Z

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

maxrjones · 2024-03-28T21:27:53Z

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

In my mind the datasets containing multiple manifests served the same purpose as consolidated metadata with the added bonus of including key mappings for individual arrays, and so dataset.to_dict(data=False) would be a version of the manifests that accomplishes the same thing as consolidated metadata and could be used for V3. But I see now that you're talking about manifests only as the virtulal representation of a single array.

raybellwaves · 2024-03-29T17:54:48Z

Thanks for this package! Just want to add i'm interested in this.

Started some work on kerchunk to make ZarrToZarr have parity with SingleHdf5ToZarr (fsspec/kerchunk#442) and I thought this package could help in the interim.

TomNicholas added the enhancement New feature or request label Mar 28, 2024

TomNicholas added the references generation Reading byte ranges from archival files label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual datasets from Zarr stores #63

Virtual datasets from Zarr stores #63

maxrjones commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

TomNicholas commented Mar 28, 2024 •

edited

maxrjones commented Mar 28, 2024 •

edited

maxrjones commented Mar 28, 2024 •

edited by TomNicholas

TomNicholas commented Mar 28, 2024

jhamman commented Mar 28, 2024

maxrjones commented Mar 28, 2024

raybellwaves commented Mar 29, 2024

Virtual datasets from Zarr stores #63

Virtual datasets from Zarr stores #63

Comments

maxrjones commented Mar 28, 2024

TomNicholas commented Mar 28, 2024

TomNicholas commented Mar 28, 2024 • edited

maxrjones commented Mar 28, 2024 • edited

maxrjones commented Mar 28, 2024 • edited by TomNicholas

TomNicholas commented Mar 28, 2024

jhamman commented Mar 28, 2024

maxrjones commented Mar 28, 2024

raybellwaves commented Mar 29, 2024

TomNicholas commented Mar 28, 2024 •

edited

maxrjones commented Mar 28, 2024 •

edited

maxrjones commented Mar 28, 2024 •

edited by TomNicholas