Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codec specification in v3 #293

Open
d-v-b opened this issue May 7, 2024 · 8 comments
Open

codec specification in v3 #293

d-v-b opened this issue May 7, 2024 · 8 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented May 7, 2024

I will summarize a few concerns I have about the way codecs are handled in the v3 spec, and propose some changes that I think could improve this situation.

the codec problem space

We need Zarr implementations across multiple languages to agree on standard JSON serialization for different codecs. This protects users from fragmentation, e.g. a situation where we end up with multiple flavors of JSON serialization for the same popular codec. At the same time, we want to make it easy for users to experiment with and create new codecs; this enables users to get the most from Zarr.

Also, codecs are generally useful for users outside of Zarr. There are plenty of non-Zarr use cases for compressing / rearranging array data. So I think the codec standardization should support these non-Zarr use cases.

concerns with codecs in the v3 spec

  • The v3 spec explicitly states that it does not define a list of codecs, but it does define a list of codecs. We can't have blatant contradictions in the spec, so this needs to be sorted out at a minimum, regardless of whatever decisions we make. The contradiction between the text of the spec and the codec definitions was already a source of confusion in a pull request in zarr-python.
  • Suppose we resolve the above contradiction by stating that zarr v3 does in fact define a fixed set of codecs, where are listed in in the spec. This leads to two sub-problems:
    • How does someone design and use a new codec? We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.
    • What happens if an implementation does not support a codec from the standard list? There is no enforcement mechanism for the requirement that an implementation support that fixed set, so practically the requirement is toothless, which means it cannot be a requirement. Requirements in the spec should be restricted to essential features, but supporting the Gzip compressor is simply not essential, for users who don't work with Gzip-compressed data. So any list of codecs should be a recommendation, not a requirement.
  • The v3 spec states that the unique identifier for a codec must be "... a URI that dereferences to a human-readable specification of the codec".
    Software cannot check if a URI dereferences to a human-readable document. If we want Zarr v3 hierarchies to be validated by software, we must remove this requirement.

how to resolve these concerns

I don't think naming a closed set of "official codecs" in the spec is realistic. There is no enforcement mechanism, and ultimately users don't care if an implementation doesn't support a codec they don't use. That is, if an implementation doesn't support codec X, and none of the users of that implementation use codec X, then IMO this is fine.

To express this differently, I think the Zarr spec should not enumerate the features / behavior an implementation must have. The Zarr spec should just describe the Zarr format, and we leave it to implementations to choose how they implement that format.

Extending this logic, the Zarr format is actually agnostic with respect to particular codecs. So specific codecs should not appear in the Zarr spec! I actually think codecs should be defined entirely in another spec, and we refer to this spec in the Zarr spec, e.g. "codecs is a JSON array of JSON objects that implement the Numcodecs spec (link to the numcodecs spec)" (we can choose a different name for the codecs spec, but it shouldn't refer to zarr).

Recall that In Zarr v2, codecs were basically standardized by the behavior of the numcodecs python library, which was a stand-alone library with no Zarr dependency. I think this illustrates the right relationship between codecs and the zarr format, but we shouldn't rely on a python library to define a standard for a cross-language concern. Zarr v3 tries to fix the latter problem by folding codec definition inside the spec itself, but as I have argued, this introduces a different set of problems. The solution is to define codecs separately, and make the zarr spec depend on that codec spec. The codec specification can manage a registry of codecs, etc, thereby abstracting the current behavior of numcodecs in a language-agnostic way.

Another advantage of a separate spec for codecs is that this spec could be used by any project that wants to compress arrays in a standard way. There is nothing Zarr-specific about serializing GZip parameters to JSON, so lets reflect this in the structure of the specification document.

tldr; I think the list of codecs in v3 is trying to solve a problem (a language-agnostic list of codecs) that we can solve in a better way: by migrating the codec specification from Zarr v3 into its own spec.

is this too much churn in the spec

I know it sucks to hear complaints about the spec after it's been finalized. Sorry. But I want zarr v3 to be really good, and I think the way we do codecs in v3 right now is very problematic; if my concerns are valid, then we owe it to users to get this resolved as soon as possible.

@LDeakin
Copy link

LDeakin commented May 7, 2024

I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.

A zarr implementation does not need to support every codec to be conformant, but spec'ing codecs and supporting them across more than just one implementation is essential to move forward and increase adoption. What better place to put zarr codec specs than alongside the zarr spec?

We cannot require PRs against the spec for every new codec. If writing a new codec started with getting a PR accepted in zarr-specs, nobody would ever write a new codec.

A codec does not have to start with a spec, it can start with an experimental implementation. That is basically what most of the codecs in numcodecs are. Similarly, I have multiple experimental Zarr V3 codecs implemented in zarrs that I plan to put forward once the new ZEP process has been figured out.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 7, 2024

I think one of the biggest shortfalls of Zarr V2 is the lack of codec standardisation. Numcodecs has many codecs, but they are not very useful if they are unsupported by other zarr implementations and data viewers.

I agree with this completely. My concern here is not whether we should standardize codecs; it's whether we should standardize codecs inside the Zarr specification document, or in a separate specification document.

What better place to put zarr codec specs than alongside the zarr spec?

I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.

A codec does not have to start with a spec, it can start with an experimental implementation.

That's a good idea, but technically your codecs cannot start with an experimental implementation. According to the text of the spec, your experimental codec is only valid when it is defined in a separate specification, and you give your codec a URI that resolves to a human-readable specification of the codec. Personally I don't think this is a reasonable requirement for experimental codecs.

@normanrz
Copy link
Contributor

normanrz commented May 7, 2024

Just copying my response from the zarr-python thread here:

I think it is useful to have minimal set of codecs that we expect any zarr impl to support (e.g. bytes, transpose, blosc). Other codecs can be optional. I think the zarr specification is a actually good place to list available codec specs.

I feel quite strongly, that non-standard codecs need to be labeled as such (e.g. through URI-style naming instead of short names). Having multiple codecs (even if the encoded format is only slightly different) with the same name would be a desaster. Perhaps zarr-python should even enforce that (ie. don't allow short names for non-standard codecs).

@d-v-b
Copy link
Contributor Author

d-v-b commented May 7, 2024

@normanrz could you elaborate on these points a bit? Do you think the spec should require or merely suggest that implementations support a fixed set of codecs? If you want this to be a requirement, how would we enforce it?

Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs? What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?

@normanrz
Copy link
Contributor

normanrz commented May 7, 2024

Do you think the spec should require or merely suggest that implementations support a fixed set of codecs?

Some codecs are essential to how Zarr works and should be required by all implementations. Most minimally, that is the bytes codec. Other codecs are so popular and general that all implementations should implement it, e.g. blosc, transpose, gzip, zstd, sharding_indexed. Then, there might be codecs that might only be relevant for a subset of the community, such as image or segmentation compression codecs. These might be optional from a Zarr pov but required by higher level format (e.g. OME-Zarr).

If you want this to be a requirement, how would we enforce it?

I like to think that enforcement of the Zarr spec comes through validation from multiple implementations. When opening an array or group, implementations parse the metadata and therefore implicitly or explicitly validate the metadata.
If you only ever use your data with a single implementation, you might not get that validation. But then you also might not care about the interoperability that the spec provides.
Of course, we could (and maybe should) also provide validation tools alongside the spec (e.g. json schema).

Given that the spec currently requires that all codecs have a specification, how do we formally distinguish "standard" from "non-standard" codecs?

"Standard" codec get a short name assigned by the Zarr spec (e.g. bytes). "Non-standard" codecs have a URI-style name (e.g. https://zarr.dev/numcodecs/lz4). That way, we minimize the risk of non-standard codecs conflicting each other. I think we can drop the requirement that the URI points to a human readable codec spec. A unique name should suffice for my concerns.

What is the process for converting a "non-standard" codec to a "standard codec", or vice versa?

I think we can use the ZEP process for that. Implementations that support non-standard codecs might need to support both names once a codec becomes standardized.

What better place to put zarr codec specs than alongside the zarr spec?

I think outside the Zarr spec entirely is the best place to put the codec specs. The codecs don't depend on Zarr; instead, Zarr depends on them.

From a theoretical pov, I can see that splitting the codec spec from Zarr might make sense. From a practical pov, I don't see how that would make anything easier or facilitate interoperability among the Zarr impls. I think it is best to keep the codec spec in the Zarr spec.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

I think it is best to keep the codec spec in the Zarr spec.

Is the current set of codecs inside the zarr spec? I think this is actually the root of my concern.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 8, 2024

given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?

@normanrz
Copy link
Contributor

normanrz commented May 8, 2024

Is the current set of codecs inside the zarr spec?

I think they are.

given that the zarr v3 spec document itself says that it doesn't define a list of codecs (and this claim is internally consistent -- that document does not in fact define a list of codecs), what spec are are the codec definitions part of?

I think it is unfortunate that the paragraph you cite did not get updated during the v3 spec process (a quick git blame shows that). I agree that it is inconsistent because the spec actually lists codecs. Most implementations have implemented this list of codecs. We should certainly revise this paragraph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants