Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest storage transformer #287

Open
jhamman opened this issue Feb 7, 2024 · 39 comments
Open

Manifest storage transformer #287

jhamman opened this issue Feb 7, 2024 · 39 comments

Comments

@jhamman
Copy link
Member

jhamman commented Feb 7, 2024

This issues describes a concept for a Zarr v3 Storage Transformer to enable generic indirection between the Zarr keys and the name of the underlying objects in a store. It is not a new idea (see below) but this design is meant to cover a broader set of use cases.

Goals

Design

There has been a lot written on this subject already (see issues linked above) so I'm going to attempt to jump straight into the design. The key difference between this design and prior proposals is that the manifest will be local to the Array. The reason for this is to increase the scalability, portability, and composability of the manifest concept.

Store layout

The manifest store layout will resemble that of a regular Zarr V3 store. Consider the following directory store representation:

a/zarr.json  <- group metadata
a/foo/zarr.json  <- array metadata
a/foo/manifest.json <- array manifest 
...
b/baz/zarr.json <- array metadata
b/baz/c/1/1 <- "regular" chunk
...

Note: array a/foo is a manifest array but array b/baz is a regular zarr array.

Array metadata

Manifest style arrays will need to declare a storage transformer configuration:

{
  "node_type": "array",
  ...
  "storage_transformers": [
    {
      "name": "chunk-manifest-json",
      "configuration": {
        "manifest": "./manifest.json"
      }
    }
  ]
}

Note: the small manifests could also be inlined directly into the array metadata object.

Manifest object

In my example above, the array a/foo includes a manifest object (a/foo/manifest.json) which will store the mapping of chunk keys to keys in the store:

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},  
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, 
}

path would be the only required key, offset/length/checksum/etc could all be added keys to a) inform the store how to fetch bytes from the chunk or b) provide the store with additional metadata about the chunk.

Note 1: Kerchunk also supports inline data in place of the path. That could also be supported here.
Note 2: I'm using JSON as a manifest type here, but many other options exist, including Parquet or even Zarr arrays.

Concatenating arrays:

Edit: Feb 6 7:20p PT - After thinking about this more, I'm beginning to think serialization of concatenated arrays is a trickier problem than should be addressed in the initial iteration here. The main tricky bit is how to combine arrays with compatible dtypes/shapes/chunks but with differing codecs. Details from my original ideas below but consider this redacted from the proposal for now.

Details

One of the goals above is to enable concatenating multiple Zarr arrays. The manifest approach supports a zero-copy way to achieve this. The concept here closely resembles the approach from [Kerchunk's MultiZarrToZarr](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), except it targeting individual arrays and could be made to work with any zarr arrray (not just Kerchunk references). The idea is that concatenating arrays can be done in Zarr, provided a set of constraints are met, by simply rewriting the keys. Implementations could provide a API for doing this concatenation like:

arr_a: zarr.Array = zarr.open(store_a, path='foo')  # shape=(10, 4, 5), chunks=(2, 4, 5)
arr_b: zarr.Array = zarr.open(store_b, path='bar')  # shape=(6, 4, 5), chunks=(2, 4, 5)
arr_ab: zarr.Array = zarr.concatenate([arr_a, arr_b], axis=0, store=store_c)  # shape=(16, 4, 5), chunks=(2, 4, 5)

In this example, zarr.concatenate would act similar numpy.concatenate, returning a new zarr.Array object after creating the new manifest in store_c. This could also be done in two steps by adding a save_manifest method to the Zarr arrays.

Possible extensions

I've tried very hard to keep the scope of this as small as possible. There are currently few v3 storage transformers to emulate so I think the best next step is to try out this simple approach before spending too much time on a spec or elaborating on future options. That said, there are some obvious ways to extend this:

  1. Supporting writes to manifest arrays (possible, there are many edge cases to consider)
  2. Enable content addressable storage by hashing keys during writes
  3. Support non-JSON manifests (many options)

Props

🙌 to those that have done a great job pushing this subject forward already: @martindurant, @alimanfoo, @rabernat among others.

@jhamman jhamman changed the title Manifest storage transformer (v3 storage transformer) Manifest storage transformer Feb 7, 2024
@rabernat
Copy link
Contributor

rabernat commented Feb 7, 2024

In this proposal, what type of thing is arr_ab?

@jhamman
Copy link
Member Author

jhamman commented Feb 7, 2024

In this proposal, what type of thing is arr_ab?

All three are zarr.Arrays. I'll add some clarification.

@rabernat
Copy link
Contributor

rabernat commented Feb 7, 2024

Can we write to zarr_ab? zarr_ab[0, 0] = 1?

@rabernat
Copy link
Contributor

rabernat commented Feb 7, 2024

after creating the new manifest in store_c

I think we should seriously consider a much lighter-weight concatenation method. What about just storing references to store_a and store_b, rather than duplicating the whole manifest? Basically how ncml works.

The advantages of this are that

  • It doesn't require a chunk manifest. It works with vanilla Zarr arrays.
  • It allows concatenation of arrays with different codecs and chunk sizes
  • For arrays with manifests, it doesn't require duplicating all of the references

The metadata doc would somehow contain pointers to the other metadata docs. Something like

"concatenation": {
    "axis": 0,
    "arrays": ["../foo", "../bar"]
}

The one part I can't quite see is how to do the references to the arrays. Some sort of URL syntax? Absolute vs. relative paths?

@rabernat
Copy link
Contributor

rabernat commented Feb 7, 2024

Another way of putting it is that I think perhaps "chunk manifest" and "virtual concatenation of Zarr arrays" should be completely separable and orthogonal features.

@martindurant
Copy link
Member

Note that the kerchunk method and its child here already allow for content-addressable storage, e.g., IPFS. Not sure if you meant something beyond that. There has been chatter elsewhere of chunk checksums and such (stored in metadata, not the bytes of the chunk).

For the concatenation, I would want special attention paid to the multi-dimension case. Also, some consideration of groups-of-arrays which are concatenated together would be nice, but you might say that this is an xarray concern. Are you at all considering that the array chunk grid not aligning with the chunks?

Do I understand that you imagine an output metadata structure of the main "these are the arrays" and then a JSON for each of the target arrays? Or do you end up concatenating the reference lists somewhere along the way?

One important possible extension to consider along with those given - after a prototype is established - is that we now have a way to pass per-chunk information (analogous to the "context" I fought for), and so can have different behaviours for each chunk, like a different zero point in offset-scale filtering.

@jhamman
Copy link
Member Author

jhamman commented Feb 7, 2024

Another way of putting it is that I think perhaps "chunk manifest" and "virtual concatenation of Zarr arrays" should be completely separable and orthogonal features.

I've come around on this, but not for exactly the same reason. I've now redacted my original proposal which was not 100% thought though.


Note that the kerchunk method and its child here already allow for content-addressable storage, e.g., IPFS. Not sure if you meant something beyond that.

Certainly some parallels here but this could be done without IPFS. @alimanfoo's proposal in #82 is still a good read, despite using some now-outdated vernacular.

For the concatenation, I would want special attention paid to the multi-dimension case. Also, some consideration of groups-of-arrays which are concatenated together would be nice, but you might say that this is an xarray concern. Are you at all considering that the array chunk grid not aligning with the chunks?

Again, I'm going to remove this from the proposal. But I'll just say that there are some parallels with @d-v-b's proposal to "fix zarr-python's slicing" (zarr-developers/zarr-python#1603, zarr-developers/zarr-python#980) - namely the creation of a lazy Zarr Array or ArrayView that wraps one or more Zarr array. If we take serialization off the table for now, we can think of this outside the spec conversation and explore how to address this at the implementation level.

Do I understand that you imagine an output metadata structure of the main "these are the arrays" and then a JSON for each of the target arrays? Or do you end up concatenating the reference lists somewhere along the way?

I was thinking of concatenating the references but have walked this back because you have to enforce that all array metadata is equivalent (e.g. codecs) for all concatenated arrays. @rabernat is suggesting another approach with could work to resolve those concerns.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

This is very similar to the kerchunk Reference File System format but is not exactly the same JSON format:
https://fsspec.github.io/kerchunk/spec.html

There are also at least a few implementations of the kerchunk json format outside of kerchunk itself:

Would it be advantageous to use exactly the same format?

@martindurant
Copy link
Member

a few implementations of the kerchunk json format outside of kerchunk

Can you please put references? They might be useful for inspiration.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

a few implementations of the kerchunk json format outside of kerchunk

Can you please put references? They might be useful for inspiration.

I updated my comment to include one other known implementation.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

@martindurant Is there a document that describes the kerchunk parquet format?

@martindurant
Copy link
Member

No, but I could make one.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

While we can all assume what s3:// means, in order for this to be fully specified, we also need to specify the meaning of the URLs. See zarr-developers/zeps#48 for one proposal regarding URLs I created, but something more limited could also suffice.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

Another issue to consider is the Confused deputy problem: user A might think they are writing to "s3://someone-elses-bucket/path" but actually end up writing with user A's credentials to "s3://user-a-private-bucket/other/path". Similarly, user A may think they are exposing "s3://someone-elses-bucket/path" over an HTTP server but actually end up sharing data from "s3://user-a-private-bucket/other/path" or "file:///etc/passwd".

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

No, but I could make one.

I think that would be very helpful.

@jhamman
Copy link
Member Author

jhamman commented Feb 7, 2024

@jbms - I have a few answers to your question of "why not use the kerchunk format":

  • Kerchunk represents the entire store as a single manifest, my position is that splitting manifests into separate arrays will have significant benefits
  • Kerchunk's JSON schema has some idiosyncrasies that make it difficult to use as a generic manifest - entries in the manifest are either a str or List[str, int, int]. The JSON schema describe above would be more extensible to future metadata (e.g. optional checksums).

@rabernat - missed your first comment:

Can we write to zarr_ab? zarr_ab[0, 0] = 1?

Perhaps! I have not covered this use case yet above but it could be possible. It would be tricky to update the manifest in a consistent way across multiple updates. I suggest we treat arrays with manifest storage transformers as read-only for this initial conversation.

@martindurant
Copy link
Member

Kerchunk represents the entire store as ...

kerchunk is amenable to change :). Especially if it can also maintain compatibility.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

@jbms - I have a few answers to your question of "why not use the kerchunk format":

  • Kerchunk represents the entire store as a single manifest, my position is that splitting manifests into separate arrays will have significant benefits

I can see that there are advantages to splitting but I think that is mostly orthogonal to the issue of the metadata format.

  • Kerchunk's JSON schema has some idiosyncrasies that make it difficult to use as a generic manifest - entries in the manifest are either a str or List[str, int, int]. The JSON schema describe above would be more extensible to future metadata (e.g. optional checksums).

Yes there are some idiosyncrasies and I suppose kerchunk also assumes URLs are fsspec-compatible. Still given that it is designed to address essentially exactly the same thing as kerchunk, I think it would be desirable to avoid fragmentation if possible. Particularly since there is mention of not just a json format but also a parquet format, which kerchunk also has. Maybe Martin is open to evolving the format used by kerchunk? On the other hand given the nature of these manifest formats it is relatively easy to support multiple formats since you can just convert one to the other when you load it.

@martindurant
Copy link
Member

Martin is open to evolving the format used by kerchunk?

Yes, of course: we want everything to work well together. In the current design, I suppose it's already possible to "concatenate" a kerchunk-zarr with a normal zarr. (actually, kerchunk can also reference a zarr, so something like this was already possible on v2)

@martindurant
Copy link
Member

Also worth pointing out that kerchunk's current implementation has some specific v2 stuff in it, so something will have to change for v3 no matter what.

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

As I see it, this "manifest" format could be used as a key-value store adapter independent of zarr entirely, as a transparent layer below zarr that is not explicitly indicated in the zarr metadata (i.e. as kerchunk is currently used), or as a storage transformer explicitly indicated in the zarr metadata.

Re concatenation: I think as has been discussed that is not especially a practical use case even with variable-size chunks and instead we could discuss a solution for that independently, e.g. an explicit "concatenation" / "stack" extension for zarr. See this support in tensorstore for constructing virtual stacked/concatenated views (https://google.github.io/tensorstore/driver/stack/index.html).

@jbms
Copy link
Contributor

jbms commented Feb 7, 2024

One thing that would likely be important for concatenation is the ability to specify "cropping" and other coordinate transforms -- for that the "index transform" concept in tensorstore may be relevant to consider: https://google.github.io/tensorstore/index_space.html#index-transform

@jhamman
Copy link
Member Author

jhamman commented Feb 13, 2024

I realized my last answer may have unintentionally come off as critical of the Kerchunk project. Apologies is it came across that way. Kerchunk (@martindurant) has done us all a great service by showing us what is possible here. My point above was really trying to look forward and mesh the ideas Kerchunk has introduced with the Zarr storage transformer framework. And at the same time, opening some doors for additional extensions beyond those of the Kerchunk project.

Based on @martindurant's comments, it sounds like there is plenty of room to work together on, what could be, a new spec complaint storage layout for Kerchunk.

@martindurant
Copy link
Member

I realized my last answer may have unintentionally come off as critical of the Kerchunk project.

Not at all, that's why we have these conversations. We already have redundant code for "view set of datasets" from xarray and dask, which have particular views on what arrays are and how they work.

I will say, though, that kerchunk aims to work beyond the netCDF model alone (xr trees to start, but more complex zarr group trees too) and even beyond zarr (e.g., from the simplest, supermassive compressed CSV with embedded quoted fields, to making parquet directory hierarchies and assembling feather 2 files from buffers). Whether those ideas are worth pursuing remains to be seen, but I expect there will always be some bespoke combine. logic in the kerchunk repo.

it sounds like there is plenty of room to work together on, what could be, a new spec complaint storage layout for Kerchunk.

Yes, from the combine user API to reference storage formats and more.

@maxrjones
Copy link

@jhamman what is the motivation for requiring the path key? We've run into a lot of issues related to determining whether a chunk is missing because it is entirely comprised of the fill_value or something going wrong during data production. Allowing all keys for a given chunk reference to be absent could provide a nice intermediate solution in that chunks could be explicitly defined as empty in the manifest but implicitly missing in the zarr store for space savings on sparse arrays. The space savings in the manifest itself seem minimal relative to convenience in identifying and verifying missing chunks, but I'm curious what factors I might be missing for this decision.

{
    "0.0.0":  {"path": "s3://bucket/foo.zarr/precipitation/0.0.0"},
    "0.0.1":  {"path": "s3://bucket/foo.zarr/precipitation/0.0.1"},
    "0.1.0":  {}, 
    "0.1.1":  {"path": "s3://bucket/foo.zarr/precipitation/0.1.1"}
}

@TomNicholas
Copy link

TomNicholas commented Mar 27, 2024

Allowing all keys for a given chunk reference to be absent

@maxrjones FYI see TomNicholas/VirtualiZarr#33 (comment) for a related discussion about the same issue but for the in-memory ChunkManifest.

@thewtex
Copy link

thewtex commented Mar 28, 2024

Hey, throwing out another "manifestation" 😆 of this Manifest Storage Transformer idea. It is essentially what @jhamman has proposed. Please ignore the naming because it was created before what is now known as Zarr Sharding. "Manifest Storage Transformer" is a great name. Another one could be "Composite Store".

There is a JSON manifest of other stores and their associated path for a group or array dimension. The configuration / schema of that manifest is ad-hoc based on the zarr-python store construction, but it would be better to standardize on something like what @jbms proposed in ZEP 8.

What is neat is that it demonstrates how simple an implementation can be and that it can also be reasonably performant. It uses python dictionaries / hash maps for fast look-up, but it should be easily adapted to other languages.

@jbms
Copy link
Contributor

jbms commented Mar 29, 2024

@thewtex If I understand correctly, you are proposing that the "manifest", in addition to mapping individual keys to URLs, could also map key prefixes (or more generally, arbitrary key ranges) to URL prefixes.

I would definitely support that addition.

By defining it in terms of arbitrary key prefixes / key ranges, it doesn't need to be specific to zarr at all.

@thewtex
Copy link

thewtex commented Mar 29, 2024

@jbms yes, you are right. Simplicity could be helpful and powerful here.

From my perspective, there are three big wins:

  1. The ability to scale to extremely large aggregate stores (many environments have 32 GB, etc. limits).
  2. The ability to transform components. In the test implementation there is a map_shards feature. From the perspective of creating zarrs, this supports the workflow: 1) write part of the dataset in parallel to a local directory store. 2) do some conditioning on that store, like re-chunk, re-encoding, transforming into a zip store, etc. 3) migrate from local to remote storage.
  3. Support content-addressed storage like IPFS, where the Merkel tree can be broken out into multiple smaller Merkel trees and this higher-level manifest.

@TomNicholas
Copy link

If I understand correctly, you are proposing that the "manifest", in addition to mapping individual keys to URLs, could also map key prefixes (or more generally, arbitrary key ranges) to URL prefixes.

I'm not sure I understand what this means. Can someone give a concrete example?


@jhamman How hard would it be to support appending to one dimension of a chunk manifest? People are asking for that feature in VirtualiZarr (TomNicholas/VirtualiZarr#21), and I could imagine a neat interface like xarray's ds.to_zarr(store, append_dim=...), where ds contains ManifestArray objects. But I'm not sure if trying to overwrite the manifest.json after it's been written might create consistency issues...? I guess maybe it's not that different to the overwriting of zarr array metadata that must already happen in to_zarr when appending?

@jbms
Copy link
Contributor

jbms commented May 3, 2024

If I understand correctly, you are proposing that the "manifest", in addition to mapping individual keys to URLs, could also map key prefixes (or more generally, arbitrary key ranges) to URL prefixes.

I'm not sure I understand what this means. Can someone give a concrete example?

I'm not sure what syntax would be preferred, but let's say instead of using a JSON object we use a JSON array, e.g. the following representation for your initial example:

[
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
   {"key": "0.1.0", "path": "s3://bucket/foo.nc", "offset": 300, "length": 100},  
   {"key": "0.1.1", "path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, 
}

Then we could support "prefix" in place of "key" to map an entire prefix:

[
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
   {"prefix": "0.1.", "path": "s3://bucket/bar."},  
}

This would map "0.0.0" and "0.0.1" as before, but "0.1.0" would map to "s3://bucket/bar.0" and "0.1.1" would map to "s3://bucket/bar.1". It would not be permitted to specify an offset or length when specifying prefix. You could also point to a secondary manifest JSON file (assuming we have a way to specify that in URL syntax):

[
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
   {"prefix": "0.1.", "path": "s3://bucket/bar/manifest.json|zarr_chunk_manifest:"},  
}

This would map "0.1.0" to "s3://bucket/bar/manifest.json|zarr_chunk_manifest:0", which would then get resolved by querying "0" within the manifest at "s3://bucket/bar/manifest.json".

Since the array representation creates the possibility of conflicts between keys and prefixes, we can say that later entries always take precedence:

[
   {"prefix": "", "path": "s3://bucket/baz/"},
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
   {"prefix": "0.1.", "path": "s3://bucket/bar/manifest.json|zarr_chunk_manifest:"},  
}

The initial "prefix": "" entry effectively defines a default mapping that is used for any key not covered by another entry.

Slightly more general than prefixes is to allow arbitrary lexicographical ranges:

[
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},  
   {"min": "0.1", "max": "0.9", "strip_prefix": 2, "path": "s3://bucket/bar/"},  
}

Any key k that satisfies (according to lexicographical order) "0.1" <= k < 0.9 would be covered by the final entry, e.g. this would map "0.2" to "s3://bucket/bar/2". The "strip_prefix" value is the number of characters to strip from the beginning of the original key before appending to the "path", and must be less than or equal to the number of characters that are common to all keys in the range.

Note that any prefix entry could be represented as a range entry:

A prefix of "0.1." is equivalent to {"min": "0.1.", "max": "0.1/", "strip_prefix": 4}. The reason that the max is "0.1/" is because "/" is the Unicode (and ascii) character that follows ".".

In practice I would expect implementations would handle this by converting the mappings to a sorted list of disjoint keys/ranges. Then key lookup can be done with a binary search.

@TomNicholas
Copy link

Thank you @jbms! I now am following the conversation again 😅

So whilst that prefix stuff is cool, I do wonder if the additional complexity needs to be considered. The simple path, offset, length triplet is nice because you can think of it as 3 arrays (or one array containing 3-tuple entries). This property is useful both for in-memory storage (e.g. as a structured numpy array like I suggested here, and started working on here) and on-disk storage (e.g. as 3 zarr arrays like @rabernat suggested here). Adding a prefix breaks this property because you no longer have the same set of fields for every entry in the manifest.

Also it seems to me that the main use case of the prefix idea is to allow part of one chunk manifest to point to a different zarr array (which might have it's own manifest). This seems like a feature that could alternatively be thought of as "virtual concatenation" of existing zarr arrays, and implemented in #288 instead?

@martindurant
Copy link
Member

I completely agree with @TomNicholas .

  • kerchunk already has columnar storage (parquet) and shown that limiting the schema is very helpful and efficient
  • why would you mix references and concatenation in a single layer like this? If you have virtual concatenation, then that can be one spec, and the references another:
# 0.0.manifest
[
   {"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
   {"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
]
# 0.virt-manifest
{
   {"prefix": "0.0.", "path": "...0.0.manifest"},  
   {"prefix": "0.1.", "path": "...0.1.manifest"},  
}
  • what if you have ranges? 0.0.0-0.0.10 is one set but 0.0.10-0.0.20 is another?

@jbms
Copy link
Contributor

jbms commented May 4, 2024

I have mostly been considering this JSON format in the context of general use, e.g. more as a generic key-value store adapter like the kerchunk reference filesystem, or like a zarr group storage transformer, rather than just a zarr array storage transformer in particular, because the proposed JSON representation really wasn't specific to zarr arrays at all.

The prefix and lexicographical range mapping I mentioned would indeed be more useful for non-array uses, e.g. it would allow you to compose a "virtual group" from arrays or groups located in different places. For arrays one potential use would be to define a default prefix mapping (empty prefix) and then override a small number of individual chunks, as a sort of "patch" for an existing array. Other uses for non-empty prefix mappings for arrays would potentially be to override various sub-regions of the array, but indeed that would probably be better represented via the virtual concatenation proposal because doing it at the key level would be rather awkward.

As far as representing the manifest as a single "structured" array or 3 arrays --- are we talking about the on-disk format (i.e. something entirely different from the proposed JSON format), or are we talking about an in-memory representation only?

For the on-disk format, if the mapping is not expected to be sparse, and the total number of chunks is very large, then using a chunked representation in the form of a zarr array to represent the mapping could make sense (where each element of this mapping zarr array corresponds to a chunk within the logical zarr array), and indeed prefix or range mapping doesn't fit into that representation at all. Potentially this could also be viewed as a type of "array -> bytes" codec rather than a storage transformer, but I'm not sure whether that is ultimately better.

I agree that a columnar representation, where within each chunk of the mapping array, the urls, offsets, and lengths are compressed independently, would be very helpful in that case. However, it would be unfortunate if the urls, offsets, and lengths for a given chunk are actually stored separately, because that would mean you need to do 3 reads instead of 1 in order to load the mappings for a given chunk, and there would be little reason to want to access the fields separately.

Even if we consider the more general mapping case (i.e. like kerchunk reference filesystem or a zarr group storage transformer), I agree that a columnar storage format (e.g. perhaps parquet) would be advantageous, though for small mappings there are advantages to JSON. I think that prefix and lexicographical range mappings can fit pretty easily into such a format, though. For example you could have 4 columns: min_key, max_key, strip_prefix_length, offset, length, where for individual key mappings we use (min_key, offset, length) and for key range mappings we use (min_key, max_key, strip_prefix_length). Parquet, for example, can represent missing fields pretty efficiently, and assuming you have normalized all of the key ranges to be disjoint, and order the entries by min_key, the column indexes supported by parquet would enable efficient lookups.

@TomNicholas
Copy link

The prefix and lexicographical range mapping I mentioned would indeed be more useful for non-array uses ...

I'm still not seeing what the use case for prefixes is that couldn't be supported through redirection via chunk manifests + virtual concatenation.

As far as representing the manifest as a single "structured" array or 3 arrays --- are we talking about the on-disk format (i.e. something entirely different from the proposed JSON format), or are we talking about an in-memory representation only?

I was talking about both, and linked to two separate issues in the VirtualiZarr repo, one where I discuss the in-memory case, and one where Ryan suggests the on-disk case. The on-disk case is more worthy of discussion here I think - the in-memory case just happens to be convenient in python.

However, it would be unfortunate if the urls, offsets, and lengths for a given chunk are actually stored separately, because that would mean you need to do 3 reads instead of 1 in order to load the mappings for a given chunk, and there would be little reason to want to access the fields separately.

What on-disk data types does zarr v3 support? Seems like that page of the spec has not been written yet. I ask because in numpy there is an in-memory datatype that contains the url, offset, and length all in one, and if we could save that data type to disk we would not need 3 reads.

@jbms
Copy link
Contributor

jbms commented May 6, 2024

The prefix and lexicographical range mapping I mentioned would indeed be more useful for non-array uses ...

I'm still not seeing what the use case for prefixes is that couldn't be supported through redirection via chunk manifests + virtual concatenation.

For redirecting an entire array, for example, using a chunk manifest means that you have to fetch the list of all of the chunks, which while perhaps desirable in some cases in other cases would be unnecessary and expensive. Additionally, using a chunk manifest means that the linked array must be immutable --- support for writing is lost, and also any further changes to the linked array will, in general, break the manifest. For redirecting an entire group, this issue applies to an even greater extent.

As far as representing the manifest as a single "structured" array or 3 arrays --- are we talking about the on-disk format (i.e. something entirely different from the proposed JSON format), or are we talking about an in-memory representation only?

I was talking about both, and linked to two separate issues in the VirtualiZarr repo, one where I discuss the in-memory case, and one where Ryan suggests the on-disk case. The on-disk case is more worthy of discussion here I think - the in-memory case just happens to be convenient in python.

However, it would be unfortunate if the urls, offsets, and lengths for a given chunk are actually stored separately, because that would mean you need to do 3 reads instead of 1 in order to load the mappings for a given chunk, and there would be little reason to want to access the fields separately.

What on-disk data types does zarr v3 support? Seems like that page of the spec has not been written yet. I ask because in numpy there is an in-memory datatype that contains the url, offset, and length all in one, and if we could save that data type to disk we would not need 3 reads.

The data types are described here: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types

In zarr v2, "structured data types" equivalent to numpy structured data types are supported, but they are not part of the zarr v3 spec.

Note that numpy structured data types interleave the fields (which may or may not be good for in-memory representation, but usually is not good for on-disk representation), and numpy 2 does not currently allow StringDType in a structured dtype.

@martindurant
Copy link
Member

For redirecting an entire array, for example, using a chunk manifest means that you have to fetch the list of all of the chunks

kerchunk parquet is chunked and supports fast random access. It is also efficient both on-disc and in memory.

Additionally, using a chunk manifest means that the linked array must be immutable --- support for writing is lost

kerchunk parquet supports writing reference sets back to the original or to a new location. You can update one chunk without altering the rest. (There are no locks on this process for multiple writers, but there could be)

"structured data types" equivalent to numpy structured data types are supported ... numpy structured data types interleave the fields

This is a terrible idea for storage - there is a reason that parquet has won so comprehensibly for bulk tabular data. For numpy-style specificaly, only fixed-length string fields can be stored anyway, for which you had better have very similar values and know the max length ahead of time.

@jbms
Copy link
Contributor

jbms commented May 6, 2024

For redirecting an entire array, for example, using a chunk manifest means that you have to fetch the list of all of the chunks

kerchunk parquet is chunked and supports fast random access. It is also efficient both on-disc and in memory.

I haven't looked at it in detail, but I would indeed be inclined to think that parquet is a good choice for this use case. I think you could also add support for prefix and/or key-range mappings to the kerchunk parquet format pretty easily.

For the specific case where a prefix mapping might be used, i.e. map "myarray1/" -> "s3://myarray1-bucket/" and "myarray2/" -> "s3://myarray2-bucket/", the prefix mapping will in general be much more efficient than even the most efficient non-prefix map, since it is constant space. The exception is that listing would almost surely be faster with an explicit manifest.

Additionally, using a chunk manifest means that the linked array must be immutable --- support for writing is lost

kerchunk parquet supports writing reference sets back to the original or to a new location. You can update one chunk without altering the rest. (There are no locks on this process for multiple writers, but there could be)

Basically a prefix map is analogous to a symlink to a directory, while an explicit chunk manifest (i.e. no prefix maps) would be analogous to a directory of symlinks to files. Both have uses, some use cases might be well served by either representation, and certain use cases will favor one representation over the other. With a prefix map, you don't need to perform an "indexing" step to generate the explicit manifest in the first place, and you will automatically pick up any new files that are added to the source location.

To me, prefix and key-range maps seem pretty powerful since you can combine them with any other adapters (zip files, another layer of chunk manifest, etc.) supported by the URL syntax.

However, I can understand that they may not be helpful for the use cases you may be thinking of, like representing an hdf5 array as a zarr array.

It might be reasonable to exclude support for prefix/range maps from an initial version of this chunk manifest format, but it might be helpful to design the format with the possibility of adding that later.

"structured data types" equivalent to numpy structured data types are supported ... numpy structured data types interleave the fields

This is a terrible idea for storage - there is a reason that parquet has won so comprehensibly for bulk tabular data. For numpy-style specificaly, only fixed-length string fields can be stored anyway, for which you had better have very similar values and know the max length ahead of time.

@martindurant
Copy link
Member

I think I am arguing that prefix maps with links alone are fine, and how concat/merge can work (as in virtualizarr); and all-references like kerchunk already uses are fine; but I would not mix them.

It's worth pointing out that the kerchunk spec allows for templating URLs, but the feature wasn't much used. In fact, compression on strings is such, that a column of strings sharing a small number of prefixes compresses almost to the same size as those paths without prefixes.

import uuid
import cramjam

pref = "s3://bucket/path"
paths1 = [uuid.uuid4().hex[:4] for _ in range(10000)]
paths2 = [f"{pref}/{_}" for _ in paths1]

len("".join(paths1)) # 40000
len(cramjam.zstd.compress("".join(paths1).encode(), 9)) # 20257
len("".join(paths2)) # 210000
len(cramjam.zstd.compress("".join(paths2).encode(), 9)) # 29134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants