Skip to content
Liqi Xu edited this page Oct 14, 2017 · 5 revisions

OrpheusDB: Bolt-On Versioning for Relational Databases

OrpheusDB is a hosted system that supports relational dataset version management. OrpheusDB is built on top of standard relational databases (currently PostgreSQL), thus it inherits much of the same benefits of relational databases, while also compactly storing, tracking, and recreating versions on demand, all very efficiently.

Our current version supports advanced querying capabilities, using both the git-style version control commands, as well as SQL queries on one or more dataset versions. The paper describing the design, functionality, optimization, and performance evaluation can be found at this link.

OrpheusDB is a multi-year project, supported by the National Science Foundation via award number 1513407. It shares the vision of the DataHub project in supporting collaborative data analytics.

Key Design Innovations

  • OrpheusDB is built on top of a traditional relational database, thus it inherits all of the standard benefits of relational database systems "for free"
  • OrpheusDB supports advanced querying and versioning capabilities, via both SQL queries and git-style version control commands.
  • OrpheusDB uses a sophisticated data model, coupled with partition optimization algorithms1, to provide efficient version control performance over large-scale datasets.

Dataset Version Control in OrpheusDB

The fundamental unit of storage within OrpheusDB is a collaborative versioned dataset (CVD) to which one or more users can contribute, representing a collection of versions of a single relational dataset, with a fixed schema. There is a many-to-many relationship between records in the relation and versions that are captured within the CVD: each record can belong to many versions, and each version can contain many records.

Users can operate on CVDs much like they would with source code version control. The checkout command allows users to materialize one or more specific versions of a CVD as a newly created regular table within a relational database or as a csv file; the commit command allows users to add a new version to a CVD by making the local changes made by the user on their materialized table or on their exported csv file visible to others. Other git-style commands we support include init, create_user, config, whoami, ls, drop, and optimize.

Users can also execute SQL queries on one or more relational dataset versions within a CVD via the command line using the run command, without requiring the corresponding dataset versions to be materialized. Beyond executing queries on a small number of versions, users can also apply aggregation grouped by version ids, or identify versions that satisfy some property.

Data Model

Each CVD in OrpheusDB corresponds to three underlying relational tables: the data table, the index table, and the version table. To capture dataset versions, we represent the records of a dataset in the data table and mapping between versions and records in the index table. Finally, we store version-level provenance information in the version table, including attributes such as author, num_records, parent, children, create_time, commit_time, and commit_msg.

Our experimental evaluation demonstrates that, compared to other alternative data models, our data model, coupled with the partition optimizer results in 10x less storage consumption, 1000x less time for commit and comparable query performance for the checkout command. In other words, OrpheusDB achieves an efficient balance between storage consumption and query latencies.