Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3: node_type considered harmful #269

Open
DennisHeimbigner opened this issue Oct 26, 2023 · 11 comments
Open

V3: node_type considered harmful #269

DennisHeimbigner opened this issue Oct 26, 2023 · 11 comments

Comments

@DennisHeimbigner
Copy link

I believe that putting the node type (group or array) into the zarr.info object is not a good idea.
In Zarr v2, the node type was encoded in the object name: .zgroup or .zarray. Hence it is possible
to identify all the groups and all the arrays without having to read any object.
With Zarr V3, you can only identify the groups and arrays by reading each occurrence of the zarr.info object.
This is potentially a significant performance hit.
I would propose a change to something like "zgroup.json" and "zarray.json".

@rabernat
Copy link
Contributor

Dennis, it's great to finally have your feedback. This issue was discussed extensively nearly a year ago, see #177
There was a strong rationale for doing it this way (avoiding a race condition with two concurrent writers trying to create a group and array of the same name). But there were also many (including myself), who favored your position.

You're very active today with many great suggestions. Unfortunately, the window for major changes to the V3 spec has passed. On May 8, @WardF voted yes on ZEP-1 on behalf of Unidata: #227 (comment).

Now that you are actually trying to implement Zarr V3, you are finding many problems with the spec. I think this is very natural and to be expected. I feel we got the sequencing wrong by first writing this massive spec without really trying to implement it. But if we stick to the process we have defined for ourselves, the comment period is over.

I'd welcome thoughts on how to move forward in a productive way.

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

A key point that came up in the discussions was that typically when listing the contents of a zarr hierarchy, you will most likely want more information than just whether it is a group or an array, such as the shapes and data types of any arrays, and would need to read the metadata anyway.

@DennisHeimbigner
Copy link
Author

"When listing the contents of a zarr hierachy..." Agreed,
But one use case we have is lazy loading of arrays. In that case,
we need to locate where the arrays are located and then read
the zarr.info only when the array is accessed.

@DennisHeimbigner
Copy link
Author

I am in the process of implementing V3, and unfortunately, the only way I have of really understanding
the spec and its consequences is through an implementation. Just the way I work.
As for being too late, surely there is room for changes. I see a number of PRs (even referenced in the V3 spec)
proposing significant changes: e.g. moving attributes to a separate object.

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

"When listing the contents of a zarr hierachy..." Agreed, But one use case we have is lazy loading of arrays. In that case, we need to locate where the arrays are located and then read the zarr.info only when the array is accessed.

Can you say a bit more what "lazy loading" means? Is the issue that your API provides a list of all of the arrays within a group (but no other information)?

@DennisHeimbigner
Copy link
Author

Yes. Also with groups. We also do lazy attribute loading which is why I support storing attributes as a separate object.

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

Could you add a new API function that returns the list of all members, sub-groups and arrays, and recommend that users call that instead for greater efficiency when using zarr v3?

@DennisHeimbigner
Copy link
Author

Not sure I follow. There seems -- to the client -- not a lot of difference in calling
two functions, one for groups, one for variables versus calling a single function.
Where does your proposed efficiency come from with respect to V3?

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

If there is one API function that lists both arrays and groups, and just returns a flat list of names without indicating which are arrays and which are groups, then you don't need to read the zarr.json files to handle calls to that API function.

@DennisHeimbigner
Copy link
Author

I see. Our API does differentiate. And this is important because it is possible to walk the group tree
without having to read any objects (at least in NCZarr).

@rabernat
Copy link
Contributor

As for being too late, surely there is room for changes.

Yes, I am very optimistic that we can incorporate your suggestions and input. FWIW, I think it would be better to evolve the spec incrementally towards a "final" version as we work on implementing it, rather than deciding on the full spec up front and approving it through a big vote. The way STAC has developed their spec seems to be more effective.

And this is important because it is possible to walk the group tree without having to read any objects

I see this as a very important goal. The algorithm for walking a hierarchy is considerably more complex if you can't identify whether a node is a group or array based on the node name. As some point, the decision was made that the cost of reading the metadata files was not significant. However, I'd like to see some quantitative analysis to confirm this for realistic datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants