Skip to content

Existing Solutions

Ben Stabler edited this page Mar 18, 2017 · 6 revisions

HDF5

From the HDF Group's FAQ, The HDF5 technology suite includes:

  • A versatile data model that can represent very complex data objects and a wide variety of metadata.
  • A completely portable file format with no limit on the number or size of data objects in the collection.
  • A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
  • A rich set of integrated performance features that allow for access time and storage space optimizations. Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

More discussion about HDF5 is here

NetCDF v3

From the NetCDF FAQ:

NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages. The netCDF libraries support a machine-independent format for representing scientific data. Together, the interfaces, libraries, and format support the creation, access, and sharing of scientific data.

NetCDF data is:

  • Self-Describing. A netCDF file includes information about the data it contains.
  • Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
  • Scalable. A small subset of a large dataset may be accessed efficiently.
  • Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
  • Sharable. One writer and multiple readers may simultaneously access the same netCDF file.
  • Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.

NetCDF v4 has roughly the same featureset as v3, but notably implements the Common Data Form on top of HDF5, providing an alternative implementation for reading / writing most HDF5 datasets.

SQLite

From the website (http://sqlite.org): SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

SQL is the standard access mechanism for relational data and can naturally handle row and column indexing with equal efficiency. It is not an "out of the box" solution (though neither are other solutions such as HDF5 or NetCDF).

BSON

From the BSON Web Page (http://bsonspec.org/):

BSON [bee · sahn], short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type.

BSON can be com­pared to bin­ary inter­change for­mats, like Proto­col Buf­fers. BSON is more "schema-less" than Proto­col Buf­fers, which can give it an ad­vant­age in flex­ib­il­ity but also a slight dis­ad­vant­age in space ef­fi­ciency (BSON has over­head for field names with­in the seri­al­ized data).

BSON was de­signed to have the fol­low­ing three char­ac­ter­ist­ics:

  • Lightweight
 - Keep­ing spa­tial over­head to a min­im­um is im­port­ant for any data rep­res­ent­a­tion format, es­pe­cially when used over the net­work.
  • Traversable - BSON is de­signed to be tra­versed eas­ily. This is a vi­tal prop­erty in its role as the primary data rep­res­ent­a­tion for Mon­goDB.
  • Efficient - 
En­cod­ing data to BSON and de­cod­ing from BSON can be per­formed very quickly in most lan­guages due to the use of C data types.

Zip Matrix

Zip matrix (or ZMX) is a file format developed by Parsons Brinckerhoff for stored matrices in compressed form. The file is easy to read and write and only requires a library that understands zip files. Zip Tensor was a proposed replacement for Zip Matrix also developed by Parsons Brinckerhoff. It never gained much use, but had a few interesting features. A brief summary of it is as follows:

  • Like Zip Matrix, it uses the zip file format as a container. Rather than disassembling the various components elements, data is zipped in full chunks. This means fine grained access for things like matrix rows is not available directly from the file. The intention behind this was that with the current state of things, generally available computer memory and speed allows for easy unzipping of data, and over-complicating the format was not worth the (possibly negligible) performance benefits.
  • Multiple tensors (generalized matrices with 0 to n dimensions) can be held in each zip tensor, so long as they have the same shape. The tensors in a given zip tensor may be of different data types.
  • Indices, which provide aliases for dimensional indexes, can be stored in a zip tensor. These indices may reshape (subset or rearrange) the tensors.
  • Tensor groups may be specified, which allow sets of tensors to be logically grouped with names.
  • Each element in the zip tensor can have any amount of metadata associated with it, in the form of (String) key->value pairs.

Vendor Binary Formats

The major transportation modeling software packages have their own vendor specific binary formats as well. There are four major transportation modeling vendors and each has a binary matrix format (or multiple ones). All of the vendor tools can read/write simple text format matrices.

  • Cube - Cube stores matrices in binary format and matrix I/O is done through a C++ library stored in a DLL. One matrix file has multiple matrices stored by name and number. A license is required for binary-level matrix I/O.

  • EMME - With the release of EMME 4, EMME matrices are stored in EMX files inside an emmemat folder. Each EMX file stores one matrix. Matrix I/O is done through the EMME Modeller Python API, which is when row and column names (i.e. zone names/numbers) can be referenced. An example of matrix I/O can be found in the case studies. A license is required for binary-level matrix I/O. EMME also recently added their own simple binary matrix I/O format. This format contains one matrix per file and can be used to store data in. The Modeller Python API is still required to get/set the matrix into EMX format for use with EMME.

  • TransCAD - TransCAD stores matrices in binary format and matrix I/O is done through a C++ library stored in a DLL. One matrix file has multiple matrices stored by name and number. A license is required for binary-level matrix I/O.

  • VISUM - VISUM's compressed binary matrix format, called $BI or $BK, stores one matrix per file. The compression library used in zlib and the matrices can be read/written using the VisumPy Python libraries. A license is not required for binary-level matrix I/O.

Text Formats

To represent a matrix as text, the matrix dimensions and cells (and possibly other metadata) are encoded as ASCII or Unicode. The formats described on this page illustrate various techniques for representing a matrix in regular text formats.

Spreadsheet-like formats (e.g. CSV or “Comma-Separated Values”)

The most basic representation of a matrix is to construct a table of data organized as rows and columns. The format permutations are almost endless, which makes tabular formats rather hard to deal with. A good comprehensive introduction to issues that arise in textual data can be found in the Data Import/Export manual from the R statistical environment (here and here). The discussion below discusses known strategies for encoding such data

Implicit metadata formats

The number or rows and columns need not be explicitly presented if structural characteristics of the matrix are physically encoded into the textual format. In this case, the full matrix dimensions are deduced from the observed number of rows and columns (or from the maximum row and column indices). Without explicit metadata, it is less efficient to code a sparse matrix (one that has many cells with missing values or with the same repeated value), and simple compression schemes may fail as there is no automatic way to tell if the matrix ended because the maximum dimensions were reached or if the data ended prematurely because the last entries were missing or damaged.

Commonly used formats with implicit metadata include:

One line of text per matrix row, with columns either separated or fixed width

        V11 V12 V13
        V21 V22 V23
        V31 V32 V33

Row and Column indices appearing as columns, one cell value per row

        R1 C1 V11
        R1 C2 V12
        …
        R3 C2 V33
        R3 C3 V33

Row and Column indices appearing as columns, multiple cell values per row (Values in this case are X, Y and Z – see the FAF Case Study for a practical example of this format)

        R1 C1 X11 Y11 Z11
        R1 C2 X12 Y12 Z12
        …

Row and Column indices appearing as columns, followed by cell values for subsequent columns out to some maximum line length: This format stores sparse matrix values more efficiently, as I the example below where cells at R2 C1, R3 C1 and R3 C2 are implicitly empty:

        R1 C1 V11 V12 V13
        R2 C2 V22 V23
        R3 C3 V33

Explicit metadata

Most matrices that are stored as text and used for data interchange have a documented format that includes additional metadata. Here are some examples:

TNTP Format

This format is used by Hillel Bar-Gera on a set of pages he maintains for sample networks and flow data for the “Transportation Network Test Problem” (TNTP). The TNTP format is not explicitly documented, but is easy to decipher.

In the TNTP format, metadata occurs at the beginning of the file and specific metadata items are enclosed in angle brackets. The essential metadata consists of and (the latter being the sum of all values in the matrix). Metadata ends with the tag

The remainder of the file stores cell data arranged by Origin zone (line ends and spaces are used to separate items but have no other significance). Each row (origin) of data starts with the text “Origin N” followed by whitespace, where N is the number of that origin zone. Each cell consists of the column (Destination) zone, a colon, and the value of the O/D cell. Cell data is terminated by a semi-colon.. Here is a sample set of data:

        Origin 1   1 : 22.0; 2 : 13.0; 3 : 27.0; Origin 2 1 : 16.0; 2: 28.0; 3: 11.0;

NIST Matrix Market

The National Institute of Standards and Technology maintains what it calls the “Matrix Market”, described as:

A visual repository of test data for use in comparative studies of algorithms for numerical linear algebra, featuring nearly 500 sparse matrices from a variety of applications, as well as matrix generation tools and services.

The data formats in the Matrix Market are described here.

Two variants of the native Matrix Market format are provided: the coordinate format (where row/column are specifically identified), and the array format, where all entries are provided in column-oriented format (that is each row of data corresponds to a matrix row, and values are provided for each column). The coordinate format is more efficient for sparse matrices (where the matrix contains many empty values), and the array format is more efficient for dense matrices (where each cell has a non-empty value). The formats include optional comments (metadata included for a human reader, as opposed to the machine that is presumed to be the reader of the data format itself), as well as structured metadata indicating how the rest of the data is formatted, how many rows and columns exist, whether the matrix is presumed symmetric (so that only the lower triangle of data need be represented) or skew-symmetric (where the diagonal values are additionally all zero). Other formats such as Hermitian matrices are not relevant to transportation modeling applications.

Matrix Market data is typically compressed using the gzip algorithm for transmission efficiency. Software libraries to read and write the format are available in certain popular languages (C, Fortran, Matlab, and Python).

The most popular matrix market format used for the available datasets, however, is not the native one. It is the Harwell-Boeing format which is structured around the row-oriented 80-character line format that was the physical standard in the days of computer punch cards. Software libraries are available for Fortran and Matlab.

Economic I/O Format

The US Government Bureau of Economic Analysis releases economic input/output tables where the rows and columns indicate economic flows between, for example, different industrial sectors (http://www.bea.gov/itable/). These data are distributed in Excel workbooks with multiple tables in each spreadsheet. As with many Excel-based data distribution formats, transfer to other systems can be tricky since the spreadsheet includes row and column header information that can be difficult to parse automatically, and the multiple-worksheet format does not have an easy analog in other textual formats.

While Excel is widely used and is readable by open-source tools such as LibreOffice, the format is arcane, requires special library code to read, and encourages analysis directly within the spreadsheet environment (which is highly problematic from the standpoint of code transparency, verification and validation, and reuse; see for example.

Census Journey-to-Work Data

The journey-to-work data is part of the Census Transportation Planning Products (for example, here -- login required to access the data). The format used for distribution resembles the BEA Economic I/O format, and is distributed as CSV (text) or XLS (Excel) formats. Header information can be difficult to parse automatically (it is intended for human readers). The format resembles the explicit row/column matrix format, with each row corresponding to a “cell” that represents the flow from one census geographic location to another.