Skip to content

Data Structure 0.2

Ben Stabler edited this page Mar 18, 2017 · 7 revisions

The OMX data structure is a catalog of required and optional information that must appear in a well-formed matrix. For a more complete data structure that might one day be implemented (or to reflect on what needs to be in a fully developed matrix data structure), check out Data Structure of the Future below.

The OMX data structure is a container ("the OMX") that consists of a list of named two-dimensional data tables, and an optional list of named one-dimension map vectors (or "lookups", see below) used to associate an extra piece of information (such as a label or name) with each index value on a data table dimension. In addition, the OMX may have arbitrary amounts of additional cruft metadata attached in the form of a set of key:value pairs. The only restriction on what is placed in attributes in version 0.2 is that a conforming API must be able to convert the keys and values into something that can be stored as an HDF5 attribute.

The data table is understood to contain cells, which have a datatype (see below) and are located by specifying two index values (one for each of the two dimensions in the table).

Each data table in the OMX must have the same shape, specified for the entire OMX as a pair of integers defining the size of the data table in the corresponding dimensions ("first" and "second").

Note on indexing into a data table: Dimensions of the data table may be interpreted as either 0-based (in which case the shape contains a value one more than the maximum acceptable index value) or 1-based (in which the shape contains a value equal to the maximum acceptable index value), or indeed N-based where N is any integer. Individual data points in a data table (a "cell") are accessed via a 2-tuple of index values ranging from the minimum value on the dimension to the maximum, where the minimum is understood as the "base" of the index values ("N"), and the maximum is one less than the minimum plus the length of that dimension encoded in the shape element.

Each data table in the OMX has a "well-known" datatype for its cells (conceptually, an integer, floating point, or string value) that can be readily converted by an API into a native datatype. The storage format specifies what types are available (and the 0.2 format is implemented in HDF5), and the API provides translation between the storage format datatype and equivalent native datatypes. The datatype may vary from one table to another in the OMX (but each data table must have the same shape).

Each data table in the OMX may have an optional "NA" (Not Available) value consistent with the datatype and the application of the data in the data table. For floating point values,this can be an Infinity or NaN ("Not a Number") value. For integer or string datatypes, it must be set to some value that is "out of range" for valid elements stored in the table (e.g. a specific (large) positive or negative value). Selecting an NA value (and whether to have one at all) is determined by the originator of the OMX.

Each data table may have arbitrary amounts of additional cruft metadata attached in the form of a set of key:value pairs. The only restriction on what is placed in attributes in version 0.2 is that a conforming API must be able to convert the keys and values into something that can be stored as an HDF5 attribute.

In addition to data tables, the OMX may also contain lookups. A lookup is a vector the same length as one of the data table shape dimensions. The datatype of the vector is arbitrary but will typically be a string or integer. Each element of the vector will contain a (possibly duplicated) "lookup value" that is used to identify element's index position in the lookup vector with the corresponding index position along a compatible (same length) data table dimension. The lookup vector is required to have a name.

For a compatible (same length) dimension, the OMX conceptually permits using a value found in the lookup vector to identify all index locations in which that value appears in the vector (returning 1 or more integer indices that may be used to identify locations along the corresponding data table dimension). To apply a lookup, the user will use a lookup value (a key) to logically select the lookup, identify the dimension to which it is to apply, and obtain the index value (or values) on the data table by retrieving the index value (or values) of the vector elements containing the lookup value. This sounds complicated, but it is simply a reverse dictionary search (retrieving the keys/indexes for at which the lookup value is found).

Each lookup vector may have arbitrary amounts of additional cruft metadata attached in the form of a set of key:value pairs. The only restriction on what is placed in attributes in version 0.2 is that a conforming API must be able to convert the keys and values into something that can be stored as an HDF5 attribute.

Data Structure of the Future

The OMX data structure is a catalog of required and optional information that must appear in a well-formed matrix. Here is a proposed Data Structure:

Overall Matrix Metadata (key/value pairs)

  • Data Storage Format (StorageFormat); optional label; if not provided, the matrix storage is presumed to be the “default” (HDF5?); could also indicate the compression scheme (if that cannot be determined by inspection). Note that the storage format should be discoverable by direct inspection without software. For example: stored as a "magic number" in the first bytes of the storage medium that can be inspected by opening the medium and reading the first few bytes as is done, for example, with file magic numbers by the Unix/Linux file(1) command) – unlike other metadata, a “magic number” need not be a key/value pair. Determined from file extension. Attempt to open with a particular data storage library (HDF5, NetCDF)
  • Also note that if a format (HDF5?) has its own magic number identifier that can serve to identify the data format, an OMX-compliant implementation will open the storage format and then seek the OMX version identifier in its standard metadata location according to the specification for that storage format
  • The key requirement here is that you can figure out the storage format without having to make assumptions or use any particular specialized software to do so
  • Version Identifier (Version); required label (specifies OMX version against which this matrix is encoded). Ideally, the version can also be ascertained by direct inspection
  • Data Type; required label (per OMX version description of supported types, to be documented elsewhere - e.g. float, int). All cells in the OMX have this data type
  • Number of Dimensions (NDim); optional integer (default: 2; NDim >= 2)
  • Dimension descriptions for any dimension (assuming 0-based indexing and using Python-like range notation -- exact indexing scheme is an implementation detail of either an API or Storage Format): Each dimension is defined as a physical range (0:M), where M is the dimension length The dimension length is the absolute minimum information required to specify a dimension Each dimension has an optional label (e.g. "Census Geography") Each dimension has one or more optional mappings a mapping is a vector of range 0:T, where T is the number of locations in that mapping (1 <= T <= M), and each item in the mapping vector is an index in the range 0:M Mappings are identified by a required unique (but arbitrary with respect to content) label (e.g. "County", "TAZ", "Block Group"). At a minimum, there is an implicit idempotent map from a set of integers 0:M to itself (the physical matrix index) If labels are to be provided for the physical matrix index, the user will need to specify an explicit mapping to which to attach the labels Each mapping can have an optional labeling scheme (which associates an arbitrary unique label one-to-one with an index in the corresponding mapping range 0:T, so you can take discontinuous TAZ labels and pack them into a continuous integral dimension) An index on a dimension must specify a mapping (default: physical mapping) and select a location in the mapping either by integer index or (if available) by label
  • If a single dimension description is provided for the OMX, then it is implicitly used for both of the required two dimensions when defining a matrix. Thus, matrix dimensions can be fully specified with a single integer
  • The user can attach arbitrary key:value pairs containing descriptive metadata (using intuitive keys like "name" or "scenario year") to either of these: 1) The overall OMX file and 2) Individual dimensions (but NOT to the 2D blocks -- the descriptive metadata for those is constructed by assembling labels or metadata from the higher dimensions)

Indexed Data Structure

  • The matrix logically consists of cell locations identified by specifying an index value for each of the dimensions
  • Index values for a dimension D must be in the range 1:DimD.Length, where DimD is the Dimension Descriptor and Length is the length element of that descriptor
  • "Slices" of the matrix can be retrieved by specifying a vector of indices in one or more dimensions (e.g. to retrieve a set of columns from a row, specify a vector consisting of a subset of [1,2,3,…,Dim2.Length] as the column index), where each element of the vector is in the range 1:DimD.Length
  • The Data Storage Format determines how those indexes map to physical locations within the stored block of matrix data
  • Each individual language API determines how the data structure is mapped to efficient language-specific data structures (and there may be different APIs developed for the same language to favor different types of processing), and other performance-related issues such as whether the entire matrix is read into memory or whether smaller matrix slices can be loaded

Recommended API operations

  • Open/Close a matrix in a storage format
  • Create and populate a matrix in a storage format based on required metadata (matrix may be blank, all “nulls” or “zeroes”; or initial values can be supplied from language-specific data structures)
  • Read structural metadata (dimensions, dimension descriptors, data type)
  • Read/Write/Update optional matrix metadata, or dimension descriptor metadata
  • Read/Write/Update indexed matrix cells or slices with values in a language-specific data structure