Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual datasets from Zarr stores #63

Open
maxrjones opened this issue Mar 28, 2024 · 8 comments
Open

Virtual datasets from Zarr stores #63

maxrjones opened this issue Mar 28, 2024 · 8 comments
Labels
enhancement New feature or request references generation Reading byte ranges from archival files

Comments

@maxrjones
Copy link
Collaborator

@norlandrhagen and I were discussing creating virtual datasets from zarr stores earlier today (placeholder already in _automatically_determine_filetype). @TomNicholas what are your thoughts on trying out Kerchunk's single_zarr for this purpose? I think this could be a helpful step towards virtual concatenation of Zarr stores and allowing manifests to replace consolidated metadata for V3.

@TomNicholas TomNicholas added the enhancement New feature or request label Mar 28, 2024
@TomNicholas
Copy link
Owner

I think this is a good idea, that I had a mental issue for already anyway 😅 One neat thing this woudl allow is testing writing to zarr stores as manifests by round-tripping.

Kerchunk's single_zarr

We could use this and it would certainly be the quickest way, but perhaps we might be better off just writing the function ourselves? Otherwise we would be starting with a zarr store, opening it using a dependency (kerchunk and fsspec), get the opened results back as a kerchunk reference dict, then immediately converting that reference dict to ManifestArrays. Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

I suspect also that if we do it that way (the "zarr-native" way), we might later find that either we can import and use code from zarr-python, or zarr-python can take inspiration from code we write here.

@TomNicholas
Copy link
Owner

TomNicholas commented Mar 28, 2024

Actually wait I think I misunderstood what you were suggesting @maxrjones . There are two types of Zarr stores we could read byte ranges from:

  1. Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)

  2. Zarr stores containing manifest.json files. This is what my comment above was referring to. This would be the inverse operation to vds.virtualize.to_zarr().

@maxrjones
Copy link
Collaborator Author

maxrjones commented Mar 28, 2024

Maybe we should just skip kerchunk and read the zarr json manually and create the ManifestArrays immediately.

When you say "read the zarr json manually" do you mean using zarr-python? I'm not sure how just loading the .zarray (V2) or zarr.json (V3) would work because it doesn't tell you which chunks are initialized.

@maxrjones
Copy link
Collaborator Author

maxrjones commented Mar 28, 2024

Oops, didn't notice your comment before posting my last response.

Zarr v2/v3 stores which have chunks saved as compressed files (i.e. normal zarr stores). These we could and should read using kerchunk's single_zarr as you suggested (although maybe we could change the implementation in the future.)

I was referring to this case, which seems like simpler to implement right now. Although (2) is also important.

@TomNicholas
Copy link
Owner

I was referring to this case, which seems like simpler to implement right now.

Yep! If you want to make a PR for case (1) then go for it!

Although (2) is also important.

Yeah, but maybe I'll just add that in as part of #45.

@jhamman
Copy link
Collaborator

jhamman commented Mar 28, 2024

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

@maxrjones
Copy link
Collaborator Author

allowing manifests to replace consolidated metadata for V3.

@maxrjones - can you expand on this? I see these as distinct features. Consolidated metadata rolls the group/array docs up to a single json, whereas the manifests concept covers the key mappings for individual arrays.

In my mind the datasets containing multiple manifests served the same purpose as consolidated metadata with the added bonus of including key mappings for individual arrays, and so dataset.to_dict(data=False) would be a version of the manifests that accomplishes the same thing as consolidated metadata and could be used for V3. But I see now that you're talking about manifests only as the virtulal representation of a single array.

@raybellwaves
Copy link

Thanks for this package! Just want to add i'm interested in this.

Started some work on kerchunk to make ZarrToZarr have parity with SingleHdf5ToZarr (fsspec/kerchunk#442) and I thought this package could help in the interim.

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

4 participants