Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Type Extension: single point definition #273

Open
DennisHeimbigner opened this issue Oct 29, 2023 · 25 comments
Open

Data Type Extension: single point definition #273

DennisHeimbigner opened this issue Oct 29, 2023 · 25 comments

Comments

@DennisHeimbigner
Copy link

It appears that a data type extension must be re-defined at every use (specifically in each array's metadata).
It would certainly be useful if a data type extension could be defined once somewhere and used where needed.

@jbms
Copy link
Contributor

jbms commented Oct 29, 2023

I'm not sure I understand what you're asking.

A data type extension just means adding support to the spec or implementation for a new data type. It is referenced in the metadata by name, just like any existing data type

@DennisHeimbigner
Copy link
Author

Sorry, let me try to clearer.
Consider this example from the spec:

"data_type": {
        "name": "datetime",
        "configuration": {
            "unit": "ns"
        }
    },

which is defined in the metadata for an array.
The question is: suppose I want to have two arrays of type "datetime",
Do I need to repeat the above declaration in each array?

@d-v-b
Copy link
Contributor

d-v-b commented Oct 30, 2023

The question is: suppose I want to have two arrays of type "datetime",
Do I need to repeat the above declaration in each array?

Yes, because in principle the arrays could have different data types. How else should the data type of an array be specified, if not in the metadata for the array?

@DennisHeimbigner
Copy link
Author

But does each reference to "datetime" need to include the "configuration" or is can
just the name "datetime" be used; similar to how I can just say "float64"?

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

datetime hasn't been standardized so it is kind of hypothetical at this point. But the intent of configuration is to allow parameterized data types without having to encode all of the parameters as a single string. Until there actually are any parameterized data types, though, you might just ignore the possibility in your implementation. That is what I've done in tensorstore.

@DennisHeimbigner
Copy link
Author

The specific example -- datetime -- is irrelevant,
but I was planning in adding a type "char" as an
extension in support of NCZarr, so I have an interest
in this issue.

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

Is the idea that char will be a fixed-length string, and you would then use a configuration option to indicate the length in bytes, e.g. "data_type": {"name": "char", "configuration": {"length": 10}} ?

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

See also zarr-developers/zeps#47 regarding a variable-length string proposal.

@DennisHeimbigner
Copy link
Author

No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding,
typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type
for NCZarr.

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding, typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type for NCZarr.

I see, in that case you would not need any configuration options, unless you want to use a configuration option to indicate the character encoding.

@DennisHeimbigner
Copy link
Author

Yes, with the option of specifying the encoding.

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

I think there is in general a question as to what should go into the data type configuration and what should go into a separate attribute to indicate "units".

For example, if we are storing a temperature in degrees C, we would probably not have a separate "degrees C stored as float64" data type. Instead, we would store it with a data type of "float64" and use some other attribute to indicate that the unit is "degrees C".

For datetime, a unit rather than a separate data type would also seem to me to be cleaner, but many data storage formats, including zarr v2 as implemented by zarr-python, do have a separate data type for datetime.

@DennisHeimbigner
Copy link
Author

We seem to have strayed from my original question, namely, does the full datatype definition, including
"configuration" need to be included in every array in which it is used?
I gather from the above discussion that there is currently no answer to this question,
and an answer must wait until there is an actual use case.

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

If the data type has no configuration options, then it can be specified as a plain string, {"data_type": "float64", ...}. If the data type has configuration options, like {"data_type": {"name": "char", "configuration": {"encoding": "ascii"}}, ...}, then indeed those configuration options must be specified each time the data type is specified. There isn't any way to specify "default" configuration options for a data type, like saying that any time we reference char we should assume a configuration of {"encoding": "ascii"}.

@DennisHeimbigner
Copy link
Author

An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then
allow arrays to reference it by name with out the configuration.
The problem with repeating the whole type declaration repeatedly is well-known in programming
language semantics as the structural type equality problem.
Consider two type declarations in two different arrays.
Just because the declarations look identical is no reason to to assume they refer to the same type.

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then allow arrays to reference it by name with out the configuration. The problem with repeating the whole type declaration repeatedly is well-known in programming language semantics as the structural type equality problem. Consider two type declarations in two different arrays. Just because the declarations look identical is no reason to to assume they refer to the same type.

For the cases we've discussed so far, char and datetime, I don't think nominal vs structural typing would be an issue.
However, I can see that if we had a data type representing a record type containing multiple fields, that this might be more of an issue. In some ways the idea of nominal vs structural typing is also related to the issue of units --- you could imagine storing a "units" attribute for this record-typed array that indicates some unique identifier for the record type.

In general I think nominal typing is problematic because a given program may be working with arrays from more than one group. For example, suppose you are copying an array from local disk storage to s3. If the local disk array somehow references a data type defined in its parent group, and you compare data types nominally rather than structurally, there would be no way for the array stored on s3 to specify the same data type.

@DennisHeimbigner
Copy link
Author

Sure there is. The standard approach is to use fully qualified names (FQNs).

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

Sure there is. The standard approach is to use fully qualified names (FQNs).

Can you give an example of what you have in mind?

I think we run into a problem if the "fully-qualified name" is both the unique identifier and the location of the definition. If the "fully-qualified name" is independent of the location of the definition, or the definition is always provided inline with the fully-qualified name, then it seems fine.

@DennisHeimbigner
Copy link
Author

I don't follow your last paragraph.
But I would do something like this:

  1. suppose we have group /g and /g/h
  2. in /g/zarr.info, we define a type {"name": "T, "configuration": {"param1": value1, "param2": value2}
  3. We could have the following array /g/i with {..., "data_type": "T"...}
  4. and another array /g/h/j with {..., "data_type": "/g/T"...}
    If there is a problem with using e.g. /g/T, then we could use another path separator besides '/'

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

I don't follow your last paragraph. But I would do something like this:

  1. suppose we have group /g and /g/h
  2. in /g/zarr.info, we define a type {"name": "T, "configuration": {"param1": value1, "param2": value2}
  3. We could have the following array /g/i with {..., "data_type": "T"...}
  4. and another array /g/h/j with {..., "data_type": "/g/T"...}
    If there is a problem with using e.g. /g/T, then we could use another path separator besides '/'

Can you clarify how the name works? Using the datetime example, would T be "datetime"? But we might want to define two different datetime data types, e.g. one stored as second and the other as milliseconds. Do we need a separate group in order to have a different datetime data type?

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

@DennisHeimbigner
Copy link
Author

Yes, T would be e.g. datetime.
If you want two types with different precision, you would have to have two types,
datetime_sec and datetime_ms, for example. But this seems to me analogous to float32 vs float64.
For your second comment, In standard usage, the FQN is always relative to the root group.

@DennisHeimbigner
Copy link
Author

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

I misread this comment. My assumption has always been that the metadata for a given file was self contained
so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file
described somewhere?

@jbms
Copy link
Contributor

jbms commented Oct 30, 2023

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

I misread this comment. My assumption has always been that the metadata for a given file was self contained so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file described somewhere?

Currently there isn't really any type of "reference" to other arrays/groups/attributes/metadata fields of any kind anywhere in the spec.

@DennisHeimbigner
Copy link
Author

Interesting. That validates my belief that it is intended that array declarations are intended to be self contained.
In any case, yes, I am proposing references to other semantic objects in the metadata tree of a file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants