Skip to content
Ben Stabler edited this page Feb 27, 2016 · 3 revisions

HDF5 is an open data format that is widely used in advanced scientific fields such as particle physics and bioinformatics. It is very feature-rich, well-supported, and quite complex due to its flexibility.

The HDF5 Format

HDF5 is shepherded by The HDF5 Group, created in the late 1980's to handle very large, complex datasets and their metadata. The file format itself is very carefully specified here for developers of the format itself, but end users are DEFINITELY not expected to implement the file format themselves; instead, end-users typically use the HDF5 API to create, modify and read HDF5 files.

So to use HDF5 effectively, then, your application must have access to the HDF5 API. Most major languages are supported, including excellent support for C, C++, and Python via pytables. Since the main API is written in C, Java and any other language which can access an external C API can also use it once the JNI wrapper is in place. An object-oriented Java wrapper is maintained by the HDF Group, and a third-party has duplicated this functionality for .Net.

Many matrix computing and data analysis packages can easily access HDF5 files such as R, Pandas (using pytables), & MATLAB/octave.

Coding to the API is not very pretty. The API allows for extreme flexibility in data layout, compression levels, memory usage, and beyond... which makes getting up to speed on writing HDF5 code a bit arduous. An official specification for OMF would need to deliver high-quality sample code and converter utilities to help people get started.

The HDF5 source code is open source and compiles fairly easily on any major platform. (Personally I've built it on Linux, and Windows, for both 32 and 64-bit.) Microsoft build tools are recommended on Windows; the GNU MinGW toolchain on Windows is not officially supported and is not recommended.

HDF5 files can be raw or compressed natively; the raw files are quite large compared to a Cube matrix containing the same data. Turning on the native compression results in files that are about the same size as Cube, depending on contents, but compression adds significant processing time to the write step.

There is a nice GUI viewer called "ViTables" based on the pytables hdf5 support. It is available at least on Linux and Windows, is extremely fast and very useful for browsing an HDF5 file's tree hierarchy, metadata, and actual data. It's awesome.

Using HDF5 for Transportation Matrices

The HDF5 format is essentially an empty container into which you can add data tables, hierarchically. Each node has a name, and can have unlimited metadata attached to it. Each node may also have subnodes underneath it. The API makes it easy to traverse the structure and find out what's in a file, or you can just fetch a table essentially by its path, e.g. /root/table1.

For HDF5 to work as a common modeling format, this flexibility needs to be reined in: something more similar to a flat list of tables is probably more appropriate, with a common set of metadata tags such as rows & columns, maybe some standard flags such as P/A vs. O/D, etc.

Extensibility: As long as an HDF5 modeling file conforms to the standard and has tables and metadata in the right places, any other user of HDF5 would easily be able to read and make sense of it. Furthermore, the file structure is inherently extensible: an agency could add additional tables or metadata into a file as long as that data doesn't conflict with the standard layout. In other words, you can add extra stuff to a standard-format file, and it would not affect the standard usage of the file at all.

HDF5 in Practice

HDF5 is already used in several transportation applications:

As used at SFCTA

The SFCTA's model (SF-CHAMP) uses the HDF5 format for all of its matrix storage. Since road and transit pathbuilding is still performed using Cube scripts, the Cube-format matrices are converted to and from HDF5 format using a small executable converter program called "mat2h5.exe". This does add some extra processing time and storage. It would be much more efficient if Cube could read and write HDF5 natively once it is formalized. Similar converters could easily be written for other proprietary packages.

Diary-type data is also stored using HDF5; this "row-wise" data is similar in form to a database table, with dozens of columns and millions of rows. Countless python scripts do the post-processing analysis on these files, using the built-in support for HDF5 in pytables.

As used at PSRC

PSRC's intent is to use HDF5 as an interchange format for all model systems. In addition, the ability to store data in a single container has made in an attractive option for archiving scenario-specific model system inputs.

PSRC has begun to use HDF5 in conjunction with the EMME/4 software via Python. EMME/4 supports most operations from a Python-based API. Reading and writing matrices from HDF5 into in-memory Numpy arrays has proved nearly trivial, thanks to good support in the h5py module. The in-memory EMME matrices are somewhat easily, and quickly able to be mo moved back and forth into in-memory Numpy arrays.

To aid land use modeling, PSRC has added support for HDF5 to the UrbanSim / OPUS environment. That environment is again written in Python and makes use of the h5py module.

PSRC's DaySim activity-based model (which is being written in C#/.Net) must also exchange data with EMME. Support for HDF5 is currently being added by our consultant team at RSG.

In addition, there are an assortment of other tools which are slowly being migrated to HDF5, including a benefit-cost analysis tool, and various R-based visualizations. The BCA tool is currently written in Python and has to be refactored significantly due to performance runtime issues. HDF5 will be added during refactoring. PSRC is increasingly moving more of its ad hoc analyses / visualizations from Excel and other such tools to R. Some work has been done to read and write HDF5 data via the Bioconductor HDF5 package.

As used in Crowbar

Crowbar is an initiative to create service-oriented-architecture access to various modeling data formats. Currently Crowbar can read and write Cube, TransCAD, and HDF5 matrices. It requires a valid license for Cube or TransCAD matrices, but can freely read and write HDF5 matrices.

The HDF5 files that Crowbar reads and writes mimic the Cube file format: Each container just has a set of matrices under the root node, named however the user specifies, with standard metadata for "rows" and "columns". That's it! Crowbar includes its own API to read/write a matrix either in its entirety or row-by-row.

Pros and Cons

Pros

  • HDF5 is very well-known, very well-supported, essentially bug-free, and requires zero support from our team.
  • While the format itself is complex, that complexity is irrelevant as long as we define a standard layout and standard metadata for "OMF" style HDF5 files.
  • C, C++, Fortran, and Python are all natively supported by the HDF5 Group via the HDF5 API. Java and .NET wrappers exist, and implementations for R, MATLAB, and Pandas are also available. Other languages can be implemented using an external C wrapper (such as JNI for Java)
  • A nice GUI viewer for HDF5 files already exists
  • Standard OMF files can be easily extended for special needs by end-users without compromising the ability to read the standard file components, as long as the official OMF layout and metadata are not modified.
  • A simple converter utility already exists for moving Cube matrices in and out of HDF5, and other converters can easily be written. (The existing converter would of course need to be updated to match the final OMF specification)
  • As a technology platform, HDF5 is well-suited to other data formats beyond matrix data, once we move on to tackle other modeling data types. It's nice to use the same tools for similar problems.

Cons

  • The HDF5 binary format itself is complex and requires use of the official API to do anything. (This is not necessarily a "con" but it's worth putting here for discussion)
  • Coding to the HDF5 API has a learning curve; you need to be careful about memory "slices" and other weird stuff that should really just be hidden from end-users. (The excellent python implementation hides most of this)
  • Writing HDF5 files with compression enabled is slower than writing native Cube files
  • HDF5 files written without compression are very large.