Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we keep implicit groups in Zarr V3? #291

Open
d-v-b opened this issue Apr 12, 2024 · 5 comments
Open

Should we keep implicit groups in Zarr V3? #291

d-v-b opened this issue Apr 12, 2024 · 5 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 12, 2024

The v3 spec permits the existence of Zarr groups without any distinguishing metadata.

In the section comparing v3 with v2, the spec states

v3 allows for greater flexibility in how groups and arrays are created. In particular, v3 supports implicit groups, which are groups that do not have a metadata document but whose existence is implied by descendant nodes. This change enables multiple arrays to be created in parallel without generating race conditions for the metadata when creating parent groups.

So the argument here is that we want to avoid race conditions when creating arrays in parallel. Is this a serious problem for anyone? Personally, I was not aware that parallel hierarchy mutation was a design goal of Zarr. I always thought that the only parallelism guarantees were for separate array chunks; since creating nodes in the hierarchy is so simple (just write a JSON document), there shouldn't be a motivation for parallelizing this process, at least that's how it seems to me.

Later, there is a section comparing explicit and implicit groups, which states

This specification defines both implicit and explicit groups, but implementations may create an explicit group for all implicit groups they encounter, in particular when using a hierarchical storage.

Erasure of an implicit group may automatically erase any empty parent. For example on a S3 store where the namespace is flat, erasure of the last key with a prefix will erase all implicit groups in the prefix.

Care must be taken when erasing an array or a group if the parent needs to be converted into an explicit group.

A race-condition arises if a client writes an array at path P, and another client concurrently assumes P is an implicit group and writes subgroups or arrays into it. Implementations can avoid this race condition by exclusively using explicit groups.

So here we learn that implicit groups actually introduce a new type of race condition, because they make the structure of Zarr hierarchy ambiguous, and there's a suggestion that implementations modify Zarr hierarchies they encounter to insert implicit groups when they are detected. I don't think this is great. First, we have traded the race condition that motivated implicit groups for another one, so we are net 0 race conditions, and we are encouraging implementations to mutate the hierarchies they encounter, perhaps as an admission that implicit groups might be a bit of a headache in practice.

I'm honestly not sure what the advantage is of implicit groups. Here are some disadvantages, from my POV:

  • Implicit groups make the structure of the hierarchy more ambiguous. With implicit groups, two Zarr hierarchies can be "identical" yet have very different contents, because one may have explicit groups where the other has implicit groups.
  • Implicit groups make the identity of a single node ambiguous. In zarr-python, we have an API that consumes paths on a file system / object store and attempts to infer whether that path points to a Zarr array or group. With implicit groups, literally any valid path can be interpreted as a Zarr group. This means that the boundary of a zarr hierarchy is not well defined, and essentially includes the entire file system. It becomes impossible for a user to include an extra non-zarr directory inside a Zarr hierarchy. Do we want this outcome?

I think we should reconsider including implicit groups in the v3 spec. Removing implicit groups would simplify some matters over in the ongoing zarr-python v3 refactoring effort. The main question I have is whether there is anyone who really needs implicit groups for some reason, in which case I am curious to learn more about that use case.

@d-v-b d-v-b changed the title Should we keep Implicit groups in Zarr V3? Should we keep implicit groups in Zarr V3? Apr 12, 2024
@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 12, 2024

previous discussion: #184

@jbms
Copy link
Contributor

jbms commented Apr 12, 2024

I don't actually have much experience with zarr groups --- I've always just used lone arrays which may be organized in a directory hierarchy, and neither tensorstore nor neuroglancer handles zarr groups. That said, here are my thoughts on the matter:

  • Creating arrays concurrently from multiple machines (possibly within a large hierarchy of sub-groups) is definitely an important use case --- separate arrays should be even better supported concurrently than separate chunks in a single array. With explicit groups you can run into an issue on stores like s3 (unlike gcs and azure) that don't support atomic read-modify-write: each machine needs to ensure all of the ancestor groups exist, which requires writing a zarr.json file containing just the minimal metadata, but that could result in overwriting metadata/attributes already stored there by another machine. In general though when using s3 there will be a lot of limitations on concurrent access and perhaps this is not too important of an issue.
  • It already seems to be fairly common to just stuff arbitrary extra files and directories into a zarr hierarchy. Therefore we are already faced with the possibility that there are extra files/directories in the group that the zarr implementation does not understand. Therefore we might anyway think of a zarr group as just a directory that has some additional associated json metadata. It seems that the main thing you gain from requiring explicit groups is that you can list the contents of a group by first doing a regular directory listing, and then reading all of the zarr.json files nested within each subdirectory, and from this have a list of the 3 types of group "members": arrays, sub-groups, unknown other directories, unknown other files. With implicit groups, you would have to treat "unknown other directories" as "groups".

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 13, 2024

To your first point, it's helpful to know that concurrent array creation is a significant use pattern, and I see how implicit groups could be useful to avoid race conditions when creating arrays. But because concurrent hierarchy modification is so backend-dependent, think our answer here should be something like "the design of the format is such that, within a group that exists on a conventional file system or object store, creating sub-arrays and sub-groups should be safe to perform independently.", which is just another way of stating that the sub-arrays and sub-groups are specified by completely separate keys relative to the key of their parent group. This isn't so different from how we currently think about chunks: they are designed to be safe to write in parallel, but the details really depend on the storage backend you are on.

And I completely agree with your second point. An important detail I just thought of: the spec already defines a third type of directory: the directories containing chunk keys. As long as we include implicit groups, then chunk key directories are locally indistinguishable from implicit groups. So any Zarr client, when attempting to classify a directory that doesn't contain zarr.json, must climb the directory tree until it finds either a containing group or a containing array. This is super clunky, and it's a regression from v2. I really think we should fix this.

@jhamman
Copy link
Member

jhamman commented Apr 16, 2024

I don't have a ton more to add to this discussion but just want to give a +1 to the idea of removing implicit groups from the spec. From an implementation perspective, they are a total pain.

To @jbms's first point, the fact that neither tensorstore or neuroglancer care about groups (implicit or otherwise) at all indicates to me that there is some value in a directory-type structure of Zarr arrays apart from the Group abstraction. This seems fine, and if you aren't reaching for groups today, then you can continue with the array-only pattern in the absence of implicit groups.

@jhamman
Copy link
Member

jhamman commented Apr 19, 2024

Perhaps it would be good to get the input from the rest of the @zarr-developers/implementation-council here. I'm curious how other implementations are handling implicit groups at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants