Data Type Extension: single point definition #273

DennisHeimbigner · 2023-10-29T21:28:44Z

It appears that a data type extension must be re-defined at every use (specifically in each array's metadata).
It would certainly be useful if a data type extension could be defined once somewhere and used where needed.

jbms · 2023-10-29T21:49:01Z

I'm not sure I understand what you're asking.

A data type extension just means adding support to the spec or implementation for a new data type. It is referenced in the metadata by name, just like any existing data type

DennisHeimbigner · 2023-10-29T22:41:19Z

Sorry, let me try to clearer.
Consider this example from the spec:

"data_type": {
        "name": "datetime",
        "configuration": {
            "unit": "ns"
        }
    },

which is defined in the metadata for an array.
The question is: suppose I want to have two arrays of type "datetime",
Do I need to repeat the above declaration in each array?

d-v-b · 2023-10-30T13:46:01Z

The question is: suppose I want to have two arrays of type "datetime",
Do I need to repeat the above declaration in each array?

Yes, because in principle the arrays could have different data types. How else should the data type of an array be specified, if not in the metadata for the array?

DennisHeimbigner · 2023-10-30T16:11:55Z

But does each reference to "datetime" need to include the "configuration" or is can
just the name "datetime" be used; similar to how I can just say "float64"?

jbms · 2023-10-30T17:59:09Z

datetime hasn't been standardized so it is kind of hypothetical at this point. But the intent of configuration is to allow parameterized data types without having to encode all of the parameters as a single string. Until there actually are any parameterized data types, though, you might just ignore the possibility in your implementation. That is what I've done in tensorstore.

DennisHeimbigner · 2023-10-30T18:47:56Z

The specific example -- datetime -- is irrelevant,
but I was planning in adding a type "char" as an
extension in support of NCZarr, so I have an interest
in this issue.

jbms · 2023-10-30T18:54:33Z

Is the idea that char will be a fixed-length string, and you would then use a configuration option to indicate the length in bytes, e.g. "data_type": {"name": "char", "configuration": {"length": 10}} ?

jbms · 2023-10-30T18:55:27Z

See also zarr-developers/zeps#47 regarding a variable-length string proposal.

DennisHeimbigner · 2023-10-30T18:58:07Z

No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding,
typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type
for NCZarr.

jbms · 2023-10-30T19:00:56Z

No, char will be an 8-bit unsigned integer whose purpose is to hold a single character in some encoding, typically ASCII or ISO-LATIN-8859, or such. I do not want to use "uint8" because I need to identify the type for NCZarr.

I see, in that case you would not need any configuration options, unless you want to use a configuration option to indicate the character encoding.

DennisHeimbigner · 2023-10-30T19:03:35Z

Yes, with the option of specifying the encoding.

jbms · 2023-10-30T19:23:58Z

I think there is in general a question as to what should go into the data type configuration and what should go into a separate attribute to indicate "units".

For example, if we are storing a temperature in degrees C, we would probably not have a separate "degrees C stored as float64" data type. Instead, we would store it with a data type of "float64" and use some other attribute to indicate that the unit is "degrees C".

For datetime, a unit rather than a separate data type would also seem to me to be cleaner, but many data storage formats, including zarr v2 as implemented by zarr-python, do have a separate data type for datetime.

DennisHeimbigner · 2023-10-30T19:37:48Z

We seem to have strayed from my original question, namely, does the full datatype definition, including
"configuration" need to be included in every array in which it is used?
I gather from the above discussion that there is currently no answer to this question,
and an answer must wait until there is an actual use case.

jbms · 2023-10-30T19:43:59Z

If the data type has no configuration options, then it can be specified as a plain string, {"data_type": "float64", ...}. If the data type has configuration options, like {"data_type": {"name": "char", "configuration": {"encoding": "ascii"}}, ...}, then indeed those configuration options must be specified each time the data type is specified. There isn't any way to specify "default" configuration options for a data type, like saying that any time we reference char we should assume a configuration of {"encoding": "ascii"}.

DennisHeimbigner · 2023-10-30T20:11:51Z

An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then
allow arrays to reference it by name with out the configuration.
The problem with repeating the whole type declaration repeatedly is well-known in programming
language semantics as the structural type equality problem.
Consider two type declarations in two different arrays.
Just because the declarations look identical is no reason to to assume they refer to the same type.

jbms · 2023-10-30T20:27:10Z

An alternative is to reify the type and declare it once in, say the zarr.info of a group, and then allow arrays to reference it by name with out the configuration. The problem with repeating the whole type declaration repeatedly is well-known in programming language semantics as the structural type equality problem. Consider two type declarations in two different arrays. Just because the declarations look identical is no reason to to assume they refer to the same type.

For the cases we've discussed so far, char and datetime, I don't think nominal vs structural typing would be an issue.
However, I can see that if we had a data type representing a record type containing multiple fields, that this might be more of an issue. In some ways the idea of nominal vs structural typing is also related to the issue of units --- you could imagine storing a "units" attribute for this record-typed array that indicates some unique identifier for the record type.

In general I think nominal typing is problematic because a given program may be working with arrays from more than one group. For example, suppose you are copying an array from local disk storage to s3. If the local disk array somehow references a data type defined in its parent group, and you compare data types nominally rather than structurally, there would be no way for the array stored on s3 to specify the same data type.

DennisHeimbigner · 2023-10-30T20:35:00Z

Sure there is. The standard approach is to use fully qualified names (FQNs).

jbms · 2023-10-30T20:43:51Z

Sure there is. The standard approach is to use fully qualified names (FQNs).

Can you give an example of what you have in mind?

I think we run into a problem if the "fully-qualified name" is both the unique identifier and the location of the definition. If the "fully-qualified name" is independent of the location of the definition, or the definition is always provided inline with the fully-qualified name, then it seems fine.

DennisHeimbigner · 2023-10-30T20:57:32Z

I don't follow your last paragraph.
But I would do something like this:

suppose we have group /g and /g/h
in /g/zarr.info, we define a type {"name": "T, "configuration": {"param1": value1, "param2": value2}
We could have the following array /g/i with {..., "data_type": "T"...}
and another array /g/h/j with {..., "data_type": "/g/T"...}
If there is a problem with using e.g. /g/T, then we could use another path separator besides '/'

jbms · 2023-10-30T21:03:21Z

I don't follow your last paragraph. But I would do something like this:

suppose we have group /g and /g/h

in /g/zarr.info, we define a type {"name": "T, "configuration": {"param1": value1, "param2": value2}

We could have the following array /g/i with {..., "data_type": "T"...}

and another array /g/h/j with {..., "data_type": "/g/T"...}
If there is a problem with using e.g. /g/T, then we could use another path separator besides '/'

Can you clarify how the name works? Using the datetime example, would T be "datetime"? But we might want to define two different datetime data types, e.g. one stored as second and the other as milliseconds. Do we need a separate group in order to have a different datetime data type?

jbms · 2023-10-30T21:05:50Z

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

DennisHeimbigner · 2023-10-30T21:09:57Z

Yes, T would be e.g. datetime.
If you want two types with different precision, you would have to have two types,
datetime_sec and datetime_ms, for example. But this seems to me analogous to float32 vs float64.
For your second comment, In standard usage, the FQN is always relative to the root group.

DennisHeimbigner · 2023-10-30T21:16:31Z

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

I misread this comment. My assumption has always been that the metadata for a given file was self contained
so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file
described somewhere?

jbms · 2023-10-30T21:19:33Z

Regarding the "fully-qualified" name, suppose we are working with two arrays, one stored on the local disk and one stored on s3. How do we know that "/g/T" relative to some group on our local disk is supposed to be the same as "/g/T" relative to some group on s3? Even if they have identical base data type name and configuration parameters, per the idea of nominal typing we would not want to assume that they are identical.

I misread this comment. My assumption has always been that the metadata for a given file was self contained so it was not possible to reference things in other files. Is this incorrect? Is the scope of the metadata of a file described somewhere?

Currently there isn't really any type of "reference" to other arrays/groups/attributes/metadata fields of any kind anywhere in the spec.

DennisHeimbigner · 2023-10-30T21:32:27Z

Interesting. That validates my belief that it is intended that array declarations are intended to be self contained.
In any case, yes, I am proposing references to other semantic objects in the metadata tree of a file.

DennisHeimbigner mentioned this issue Oct 30, 2023

Self contained arrays #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Type Extension: single point definition #273

Data Type Extension: single point definition #273

DennisHeimbigner commented Oct 29, 2023

jbms commented Oct 29, 2023

DennisHeimbigner commented Oct 29, 2023

d-v-b commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

Data Type Extension: single point definition #273

Data Type Extension: single point definition #273

Comments

DennisHeimbigner commented Oct 29, 2023

jbms commented Oct 29, 2023

DennisHeimbigner commented Oct 29, 2023

d-v-b commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023

jbms commented Oct 30, 2023

DennisHeimbigner commented Oct 30, 2023