V3: node_type considered harmful #269

DennisHeimbigner · 2023-10-26T18:48:25Z

I believe that putting the node type (group or array) into the zarr.info object is not a good idea.
In Zarr v2, the node type was encoded in the object name: .zgroup or .zarray. Hence it is possible
to identify all the groups and all the arrays without having to read any object.
With Zarr V3, you can only identify the groups and arrays by reading each occurrence of the zarr.info object.
This is potentially a significant performance hit.
I would propose a change to something like "zgroup.json" and "zarray.json".

rabernat · 2023-10-26T19:25:28Z

Dennis, it's great to finally have your feedback. This issue was discussed extensively nearly a year ago, see #177
There was a strong rationale for doing it this way (avoiding a race condition with two concurrent writers trying to create a group and array of the same name). But there were also many (including myself), who favored your position.

You're very active today with many great suggestions. Unfortunately, the window for major changes to the V3 spec has passed. On May 8, @WardF voted yes on ZEP-1 on behalf of Unidata: #227 (comment).

Now that you are actually trying to implement Zarr V3, you are finding many problems with the spec. I think this is very natural and to be expected. I feel we got the sequencing wrong by first writing this massive spec without really trying to implement it. But if we stick to the process we have defined for ourselves, the comment period is over.

I'd welcome thoughts on how to move forward in a productive way.

jbms · 2023-10-26T20:09:18Z

A key point that came up in the discussions was that typically when listing the contents of a zarr hierarchy, you will most likely want more information than just whether it is a group or an array, such as the shapes and data types of any arrays, and would need to read the metadata anyway.

DennisHeimbigner · 2023-10-26T21:03:54Z

"When listing the contents of a zarr hierachy..." Agreed,
But one use case we have is lazy loading of arrays. In that case,
we need to locate where the arrays are located and then read
the zarr.info only when the array is accessed.

DennisHeimbigner · 2023-10-26T21:09:16Z

I am in the process of implementing V3, and unfortunately, the only way I have of really understanding
the spec and its consequences is through an implementation. Just the way I work.
As for being too late, surely there is room for changes. I see a number of PRs (even referenced in the V3 spec)
proposing significant changes: e.g. moving attributes to a separate object.

jbms · 2023-10-26T21:37:27Z

"When listing the contents of a zarr hierachy..." Agreed, But one use case we have is lazy loading of arrays. In that case, we need to locate where the arrays are located and then read the zarr.info only when the array is accessed.

Can you say a bit more what "lazy loading" means? Is the issue that your API provides a list of all of the arrays within a group (but no other information)?

DennisHeimbigner · 2023-10-26T21:42:40Z

Yes. Also with groups. We also do lazy attribute loading which is why I support storing attributes as a separate object.

jbms · 2023-10-26T22:15:35Z

Could you add a new API function that returns the list of all members, sub-groups and arrays, and recommend that users call that instead for greater efficiency when using zarr v3?

DennisHeimbigner · 2023-10-26T22:22:39Z

Not sure I follow. There seems -- to the client -- not a lot of difference in calling
two functions, one for groups, one for variables versus calling a single function.
Where does your proposed efficiency come from with respect to V3?

jbms · 2023-10-26T22:43:39Z

If there is one API function that lists both arrays and groups, and just returns a flat list of names without indicating which are arrays and which are groups, then you don't need to read the zarr.json files to handle calls to that API function.

DennisHeimbigner · 2023-10-26T23:37:42Z

I see. Our API does differentiate. And this is important because it is possible to walk the group tree
without having to read any objects (at least in NCZarr).

rabernat · 2023-10-27T13:25:31Z

As for being too late, surely there is room for changes.

Yes, I am very optimistic that we can incorporate your suggestions and input. FWIW, I think it would be better to evolve the spec incrementally towards a "final" version as we work on implementing it, rather than deciding on the full spec up front and approving it through a big vote. The way STAC has developed their spec seems to be more effective.

And this is important because it is possible to walk the group tree without having to read any objects

I see this as a very important goal. The algorithm for walking a hierarchy is considerably more complex if you can't identify whether a node is a group or array based on the node name. As some point, the decision was made that the cost of reading the metadata files was not significant. However, I'd like to see some quantitative analysis to confirm this for realistic datasets.

MSanKeys963 mentioned this issue Jan 23, 2024

Some reflections about how to improve the ZEP process zarr-developers/zeps#55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3: node_type considered harmful #269

V3: node_type considered harmful #269

DennisHeimbigner commented Oct 26, 2023

rabernat commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

rabernat commented Oct 27, 2023

V3: node_type considered harmful #269

V3: node_type considered harmful #269

Comments

DennisHeimbigner commented Oct 26, 2023

rabernat commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

rabernat commented Oct 27, 2023