Skip to content

Case Studies

Ben Stabler edited this page Mar 18, 2017 · 6 revisions

Freight Analysis Framework

The file FAF Case Study.odt contains a brief overview of the data structure of the Freight Analysis Framework (FAF3) Version 3 database. FAF3 contains information on freight flows between states, metropolitan areas and (where relevant) foreign regions. It's not stored as a matrix, but can be readily transformed into one. The additional FAF case study files are:

PSRC and EMME

In the past, our approach was to crack the monolithic Emme data store and read out the matrix information we needed. This proved problematic for several reasons including:

  • It was labor-intensive to build binary parsers for each environment in which we wanted to read Emme matrices
  • Existing reader / writer code frequently made the assumption that there were no holes in the zone numbering system
  • Inro moved to a new external matrix format that was going to require modification of our various binary parsers.
  • In discussions with Inro, they stressed that they were API-stable, but that the binary format for the external matrix format would likely change without warning, as using these matrices directly was not supported.
  • Any given forecast product may span several software systems, meaning that data in multiple formats made it cumbersome to archive inputs to integrated runs.

Feeding data back to Emme during runs also proved difficult because the primary means of data transfer into Emme was via serialization to an ASCII based file. In large runs, the time to read CSV/TSV files grew very large.

In recent versions of Emme, Inro provides a Python-based API, which has opened up new options for reading and writing matrix data. We found that there were a great number of existing serialization formats that we could easily write from Python--everything from delimited ASCII files, to XML, to SQL, to array storage containers used in other scientific disciplines.

I think I'll stop there with the problem description, rather going into how we've been proceeding with HDF5, unless people think there is value in doing so. I see that as something of a different story.

Reading and Writing Matrices with EMME

In EMME 2 and 3, matrix data I/O was done directly at the binary level. With EMME 4, matrix data I/O is via the new Modeller Python API since the binary format of the EMME database may be revised without notice. For one of our activity-based model development projects, we (Parsons Brinckerhoff) needed to pass hundreds of large matrices from EMME 4 to our Java-based software. We implemented a simple Python module to get EMME EMX matrices from the database and write the matrices in an open format called ZMX. We also also implemented the reverse - read from ZMX back into the EMME database. The ZMX readers and writers are here.

We decided against using the new EMME binary matrix format since our Java-based software could already read ZMX files.

This solution works ok. It would be better if EMME output an open matrix format to start, so the data doesn't have to be copied. For this project, there is about 500 GB of matrices for one model run. It takes time to copy that much data.

In the future, we'd like support for multiple matrices in one file as well.

Reading and Writing TransCAD Matrices with Java

TransCAD provides a simple API for accessing (reading and writing) TransCAD proprietary matrices via Java. It is basically a Java wrapper over JNI calls which access the matrices via TransCAD dll libraries. To use the API, three conditions need to be fulfilled:

  • A jar library (included with every TransCAD installation) must be included in the application's classpath.
  • The TransCAD installation directory is included in the system path (alternatively: the application is run from the TransCAD installation directory).
  • A valid TransCAD key is plugged into the machine running the application.

The API is simple and (relatively) quick, and we have developed libraries to interface with it. Generally what is done is the API is used to read a TransCAD result matrix into memory, and then the matrix is used internally and/or it is written out in a non-proprietary format. A similar workflow is used for writing matrices. Basically, the TransCAD API (and TransCAD matrices in general) are only used at the edge of the application's interfacing with TransCAD itself. Put in another way, TransCAD matrices are only used because TransCAD (effectively) requires it.

The aforementioned three conditions for using the API have created a number of friction points in our experience:

  • Including the TransCAD interfacing library in our generic code creates issues for clients which do not have TransCAD installed, or are not using the TransCAD API. The API has reference to classes which the interfacing library uses, and thus requires the inclusion of the API jar file to compile the code. The solution to this has been to build "stub" (non-functioning) versions of the API classes to allow the code to compile (it essentially acts like a C++ header file). * If the actual API will be used, then the TransCAD jar library will be used in their stead by including it on the class path before our library's jar file.
  • Condition (2) is not well documented, and is the source of numerous issues with the library. If the TransCAD install directory is included in the Java search path, then the initial API calls will succeed (they will find the dll methods), but the Java search path is not inherited by the dll, so its own internal method calls will fail when it tries to use other dlls from the TransCAD installation. This leads to confusing error messages which (wrongly) imply that the TransCAD Java interface will not work on a particular computer.
  • Condition (2) can also be onerous as many clients will not (or won't want to) have the TransCAD installation directory in theis system path. The easiest solution to this is to create a wrapper batch file to run the application which temporarily modifies the path so that the condition is met.
  • The requirement of a TransCAD key creates complications when a model is run across multiple machines. Generally, having a key on each computer is not economically feasible. Instead, a solution has to be used wherein a single computer (with a TransCAD key) accesses TransCAD matrices and diseminates them (in a non-proprietary format) to the other computers. While this is annoying, it does mesh well with the fact that TransCAD functions (e.g. traffic assignment) can only be run on a computer with a TransCAD key. When large numbers of skim matrices are produced and need to be accessed, though, this policy means that the TransCAD API becomes a bottleneck for the process.

One alternate solution which has been used in the past is to have TransCAD write its matrices out in a "fixed-format binary" format. This format only holds information for a single matrix, does not provide easy random-access, and is not compressed; however, unlike the proprietary format, it is well documented. The issue with using it is that it is slow, unwieldy, harder to read back in to TransCAD, and takes up much more disk space (of increasing concern as models, with increasingly large zonal systems, are using larger numbers of input/output matrices).

JEMnR Travel Demand Model

JEMnR (Jointly Estimated Model in R) is model platform that ODOT uses for metropolitan area models. It is a 4-step model which uses a destination choice model for trip distribution. The model was estimated by Metro. ODOT (Ben Stabler) implemented the model in R. The model uses binary R objects for storing matrices.

When the model was being developed, memory size was a significant issue, so the implementation used disk storage extensively to manage data. Hundreds of matrices are created, stored and read at various times during model execution. The matrices are organized using directory structure and naming conventions. The R code saves or retrieves a desired matrix by composing the appropriate path and file name and using it to save or fetch the matrix.

Although the approach uses a complicated directory structure and produces hundreds of files, it was relatively simple to implement and has worked reliably for the several JEMnR models that have been built and for the very many model runs that have been carried out. The data challenges posed by JEMnR are not in operation, but in using and sharing the data. The various utility matrices and other matrices are useful for explaining model run results and for producing performance measures (e.g. travel cost index and auto dependence index, see http://www.oregon.gov/ODOT/td/tp_res/docs/reports/planningperformancemeasures.pdf). This is where the JEMnR approach complicates matters. Loading the data needed to produce the desired measures, maps, etc. often requires a number of matrices to be loaded. Scripts need to be written with an understanding of the directory structure and the way in which file names need to be built to load the files. Moreover, since the files are R binary objects, they are only accessible using R. This is not a problem in our unit, but it does limit what others who are not familiar with R can do.

If the data were stored in multi-dimensional arrays that were "file addressable" (don't know the right term to use here), then it would be a more straightforward process to extract the data needed. In addition, if the arrays use a standard file structure, it would be possible to use other languages to extract and use the model data.

Comparison of LODES and MACA Data Access

LODES is the acronym for the LEHD Origin-Destination Employment Statistics data that is available on line from the U.S. Census. The LEHD (Longitudinal Employer-Household Dynamics) program matches employer statistics data collected from state employment departments with other data from the Census and IRS to produce a detailed picture of commute patterns. There Census Bureau applies some geographic jittering to the origins and destinations to insure confidentiality. Data is available by origin-destination pair, and by residence area characteristics and employment area characteristics. A web-based application is available for simple querying and visualization (OnTheMap).

MACA (Multivariate Adaptive Constructed Analogs) is statistically downscaled climate model data for the western United States. The datasets are for 14 climate models at a 4-km grid spatial resolution covering the period from 1950 to 2100 in one day increments. A file is typically about 3 GB in size and includes the data for one variable (e.g. precipitation) for a 10 year period for the entire grid.

The LODES data comes in zipped csv format. The user chooses a state and a data type (origin-destination, residence area characteristics, workplace area characteristics) and is presented with a large number of files to download. Files address different years and variables. File naming conventions are used to identify the data stored in each file. After downloading the data, accessing it in a sensible way requires some scripting to iterate through the files to open them, convert the data into a suitable data structure, and combine the results of several files. This can take a fair amount of programming.

The MACA data is stored in NetCDF format. One dataset stores the data for one model and one variable for a 10-year period (usually) for the western United States in one-day increments. The file includes all of the metadata such as the dimensions of the array (e.g. latitudes and longitudes of grid cells, reference date for days) and the measurement units. Once the file is opened, a portion of the array can be extracted: for example the daily precipitation for 1960 for the grid cells in a cutout including Oregon.

It would be nice if the LODES data was available in a form like the MACA data. The data would be self documenting and the user to pull from the dataset the portions of data desired from one dataset, rather than having to build a script to download multiple text file, convert those to an appropriate structure, and combine them into the desired dataset.