Skip to content

Open matrix requirements

Ben Stabler edited this page Feb 27, 2016 · 4 revisions

Here is a tentative list of more specific requirements for the open matrix data format.

Storage Format:

  • The storage format is open so that anyone else can access free from proprietary technology or licenses.
  • The storage format physically represents the data structure (i.e. any physical format will suffice provided the necessary structural elements are present)
  • The format must have an on-disk (persistent) specification. This does not rule out in-memory, streaming, or other protocols, but on-disk is a primary requirement. It may be desirable to support two native storage formats (netCDF does something like this): 1) One (e.g. HDF5) oriented toward efficient read/write/size performance and 2) The other, oriented toward "zero software" interhange (e.g. a textual format that is 'better than CSV')
  • Some level of compression and/or efficient storage (for non-dense or structured data) should be supported

Data Structure

  • The data structure consists of a block of storage whose structure is described by formal metadata (some of it required, some optional)
  • Metadata - Minimal metadata includes the dimensions of the matrix and the data type (e.g. floating point or integer); The format should support additional descriptive metadata about the overall matrix, the matrix dimensions, and the tables with in it (e.g. descriptions of how the data was developed, its intended use, its geographical representation, etc.)
  • Dimensions - Two-dimensional matrix (rectangular array) data is stored (a "Table"); If multiple tables exist with the same shape/dimensionality and data type, then a single open matrix data structure should be able to store them all together and index them efficiently; This requirement is a sub-species of "N-dimensional data" as described in the next bullet and may not be "core" ; N-dimensional data (same data type) is a requested enhancement, but perhaps not a core requirement. This could be supported either directly, or indirectly via collections of 2-dimensional matrices as long as the logical ability exists to index efficiently along the additional dimensions; Dimensions should be described as a logical structure with one required and several recommend elements (logically, these are metadata about the dimension) ; Dimension descriptor at a minimum must specify the "length" of the dimension (only required element); Dimension descriptor should allow for index labels (dimensional index value aliases); Restrictions on the index label values (types, contiguousness) should be minimized. Dimension descriptor can contain information on sub-matrices, nested geographies and matrix permutations. Dimension description should allow for additional arbitrary metadata
  • Data Types - At least real (double/float) and integer data formats should be available; Structured data of any type is a requested enhancement, but not a core requirement; All cells must have the same data type

Programmatic Access

  • The data format should cleanly separate (decouple) the storage format, the data structure and the API; The data structure explains the logical content of the data; The data storage format shows how that content is represented in a block of storage; An API for a supported language describes how to manipulate a block of storage in the idiom of that language
  • The storage format(s) should be described clearly and illustrated through working free code examples that can be integrated in applications intending to manipulate data stored in this format
  • The data structure should be expressed as an object model; The object model is described abstractly and illustrated through implementations of APIs in popular programming and scripting languages
  • Each language API should include essential operations on the format expressed in the idiom of that language; Creating and populating a matrix (or table) from suitable language-specific data structures; Deleting a matrix (or table); Efficient Data access (read/write) by index; Entire matrix; One table in the matrix (N-dimensional variant); One Row or column; Single cell value; Manipulating matrix, table or dimension metadata

Format Evolution

  • Each iteration of the format should include its own version identification so persons receiving a data set in this format can retrieve an appropriate specification (and ideally the supporting code and applications) and recognize from the data itself when it is necessary to do so
  • The format should be transparently extensible by allowing for future enhancements and modifications such as additional storage formats, additional dimensions, additional supported data types, etc.
  • The format should be scalable and allow for very large data sets (many GB and even TB range)