Skip to content

RhizomeDB/spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

The Postmodern Database (PomoDB) Specification v0.2.0

Editors

Authors

Dependencies

Appendices

Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Abstract

PomoDB is a content-addressable database. It enables far-edge deployments (such as IoT and consumer devices) by synchronizing heterogenous sets of end-to-end encrypted data between peers in an eventually consistent manner.

1. Introduction

A concept is a brick. It can be used to build a courthouse of reason. Or it can be thrown through the window.

โ€• Deleuze & Guattari, A Thousand Plateaus: Capitalism and Schizophrenia

1.1 Motivation

Databases increasingly treat network partitions as unavoidable disturbances to uptime. Many mitigate these situations under a model that either assumes interruptions are bounded in length (e.g. a consensus round), halting operation of disconnected nodes, or accepting data loss of conflicting writes on reconnection with a leader node. Such techniques are unsuitable for far-edge and local-first applications, where disorder is the norm, network partitions are ubiquitous and unbounded, and there's an uncompromising need for availability.

Modern environments increasingly involve unstable dynamic network topologies of heterogenous peers for which common definitions of consistency often do not apply. In this context, continued operation under network partition or when loading data into another application is not only desirable, but further, enforcing continuous consistency between all peers is a meaningless proposition entirely. This begs the question: what if there was no concept of "primary site" and all data were considered subjective relative to the viewer.

1.2 Approach

No single theory ever agrees with all the facts in its domain.

โ€” Paul Feyerabend, Against Method

PomoDB addresses the above constraints with query semantics that guarantee eventual consistency between all peers with access to the relevant data, but never requires nor assumes full synchronization between clients. This enables PomoDB to act as a sound foundation for globally distributed data with an indeterminate number of transient peers, with varied access patterns, and ever changing access to data.

Historically databases have assumed an objective worldview: there is a single, correct, objective way to look at data. This can be convenient, but is also overly simplistic for many cases. As application state becomes increasingly distributed, partition-tolerant, private, and remixable, modeling data as subjective makes increasing sense. This is a postmodern worldview, which recognizes that there is no one true way to look at things, but rather multiple interpretations that are useful for different purposes (including mathematics and science). The only objective facts in PomoDB are that the quads themselves were asserted by their author, and their claimed causal ordering.

PomoDB's approach enables one to treat access control, latency, and partitions as equivalent, and embrace (not just tolerate) all three. Conceptually there is one, unchanging, universal database, and various observers merely come to learn of parts of the database (new data, new interpretations) over time. However, like in physics, this is not an instantaneous or evenly distributed process: some observers will learn of data before others, and may interact with asymmetric information. An observer doesn't need to learn of the entire universe to use the database: a subjective view is sufficient for the majority of cases. When stronger consistency is desired, the relevant observers can synchronize the relevant subset of context.

1.3 Environments

PomoDB is designed to:

  • Run anywhere
  • Be transport agnostic
  • Operate when disconnected from the network

Of particular interest are operating in browsers, mobile applications, and IoT devices.

1.4 Classification

PomoDB can be seen as a maximally-sharded schemaless database, a meta-circular CRDT framework, or both simultaneously. It decouples the raw data from interpretation, and expresses CRDTs as database queries. Other than being able to benefit from the last 50+ years of database research, this flexibility makes the database itself user extensible.

2. Design

Simple things should be simple and complex things should be possible.

โ€” Alan Kay

2.1 Relation

A relation is a set of tuples, where each component of the tuple is called an attribute, and can be referenced by an attribute name.

An n-ary relation contains n-tuples, which can each be written: $\langle a_1: v_1, a_2: v_2, \dots , a_n: v_n \rangle$, where $a_1, a_2, \dots , a_n$ give the names of each attribute, and $v_1, v_2, \dots , v_n$ are their values.

2.2 Time

Cause and effect are two sides of one fact.

โ€” Ralph Waldo Emerson, Circles

PomoDB does not make direct use of wall clocks, and instead MUST use causal consistency based on hashes as its first-class ordering mechanism. Facts MUST include any direct "happens after" or "caused by" relationships, which naturally forms a (partition tolerant) partial order on all facts.

2.3 Content Addressing

As PomoDB is intended for use in distributed and decentralized deployments, it is important ensure the use of collision resistant identifiers when referring to tuples. For this purpose, a content addressing scheme is used. Tuples are associated with a CID computed from their structure. The details behind this computation are available in the section on facts.

The choice of CIDs as primary key โ€” rather than more familiar choices such as auto incrementing IDs or UUIDs โ€” reflects PomoDB's goals in targeting distributed and decentralized environments, where coordination around the allocation of IDs can't be guaranteed, and where resilience against malicious and byzantine actors is required.

Since the content addressing schemes used by PomoDB MUST be backed by cryptographically secure hash functions, their use here prevents forgery of IDs by attackers, and guarantees that CID-based dependencies between tuples will always be acyclic.

These properties further enable byzantine-fault tolerant CRDTs.

2.4 Query Engine

PomoDB has no specified high-level query language. An intermediate representation based on Datalog is defined instead (PomoLogic). Implementations MAY define their own user-facing query language, but they are RECOMMENDED to treat PomoLogic as a common compilation target for all such languages.

An OPTIONAL relational runtime that MAY be compiled from PomoLogic is described in PomoRA. This can be further optimized with a dataflow runtime for incrementalization.

2.5 Sources

Sources act as ingress points for a PomoDB query, and introduce tuples from the outside world, such as by loading them from a local persistence layer, or by querying them from a remote data source such as IPFS.

Sources can be queried as if they were relations. For example, scoping a query to facts from a certain source is possible.

Implementations MAY define their own sources, but sources SHOULD be non-blocking, and are RECOMMENDED to perform any blocking or IO-intensive operations asynchronously.

Sources MAY emit deltas of tuples, if a runtime able to take advantage of incremental computation is being used, like PomoFlow

Implementations MAY also support user defined sources, such as to facilitate the integration of PomoDB into external systems for persistence or communication.

2.6 Sinks

Sinks act as egress points for a PomoDB query, and emit tuples to the outside world for further processing or storage.

Sinks can be inserted into as if they were relations.

Implementations MAY define their own sinks, but sinks SHOULD be non-blocking, and are RECOMMENDED to perform any blocking or IO-intensive operations asynchronously.

Implementations MAY also support user defined sinks, such as to facilitate the integration of PomoDB into external systems for persistence or communication.

3 Components

PomoDB is composed of several cleanly separated layers. There is a distinction between the hard technical dependencies between these components and the way that this spec has decided to compose them for various pragmatic reasons.

3.1 Components

The raw dependencies between components in the systems stack as follows:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                         PomoDB                          โ”‚
โ”‚                           SDK                           โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ”‚   Query DSL 1   โ”‚ โ”‚   Query DSL 2   โ”‚ โ”‚   Query DSL 3   โ”‚
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                        Compiler                         โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ”‚     Datalog     โ”‚ โ”‚    Relational   โ”‚ โ”‚  Differential   โ”‚
โ”‚                 โ”‚ โ”‚     Algebra     โ”‚ โ”‚    Dataflow     โ”‚
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                        Pomo Store                       โ”‚
โ”‚                        EDB & IDB                        โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                Abstract Object Store                    โ”‚
โ”‚                Content Addressed Data                   โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

PomoDB chooses to implement this as a the following stack:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                         PomoDB                          โ”‚
โ”‚                           SDK                           โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ”‚   Query DSL 1   โ”‚ โ”‚   Query DSL 2   โ”‚ โ”‚   Query DSL 3   โ”‚
โ”‚                 โ”‚ โ”‚                 โ”‚ โ”‚                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                       Pomo Logic                        โ”‚
โ”‚                       Datalog IR                        โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                        Pomo RA                          โ”‚
โ”‚                     Query Planner                       โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                        Pomo Flow                        โ”‚
โ”‚              Differential Dataflow Engine               โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                         โ”‚
โ”‚                        Pomo Store                       โ”‚
โ”‚                Content Addressed Tuples                 โ”‚
โ”‚                                                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                           โ”‚ โ”‚                           โ”‚
โ”‚   WebNative File System   โ”‚ โ”‚      In-Memory Store      โ”‚
โ”‚        Synced Store       โ”‚ โ”‚       Ephemeral Data      โ”‚
โ”‚                           โ”‚ โ”‚                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 โ”‚  โ”‚                                    โ”‚
โ”‚  Remote Store   โ”‚  โ”‚             Local Store            โ”‚
โ”‚                 โ”‚  โ”‚                                    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.1.1 Pomo Logic

PomoDB is query language agnostic, and implementations MAY define arbitrary query frontends. PomoLogic is a variant of Datalog, designed for recursive query processing.

3.1.2 Pomo Flow

PomoFlow is a dataflow runtime for PomoDB. This design is especially suited for incrementalizing programs to efficiently compute over deltas to the input EDB.

3.1.3 Pomo RA

Relational algebra is the theory underpinning relational databases and the query languages against them. It provides a small core language of relational operators that serve as simple building blocks for more complex forms of query processing.

This specification describes PomoRA, a representation of the relational algebra intended for use as an intermediate representation for PomoDB query languages.

4 Extensional Database

The data layout stored to disk is the "extensional database" (EDB). This is the "ground truth" database, from which all other knowledge is derived via queries.

4.1 Types

4.1.1 Primitive Types

The primitive types supported by PomoDB MUST be as below. Note that this is at the layer of PomoDB, and the concrete persisted encoding (e.g. JSON, CBOR, Protobuf) MAY be different.

Name Size Description
Boolean One bit A boolean value.
Integer 64-bit A signed integer.
Double 64-bit Double-precision floating-point number, as described in IEEE 754-2019. NaNs SHOULD NOT be considered valid.
String Unbounded A UTF8 string. Empty strings MUST be considered valid.
Bytes Unbounded Zero or more binary octets.
CID Unbounded A CID. Avoiding the identity hash is RECOMMENDED.

Note that there is no primitive unit or null. Emulating these constructs in data modeling is NOT RECOMMENDED, as they are semantically anemic and alternative modelings are nearly always available. For example, a common use case for using null is to unset a field. CRDTs SHOULD use history-preserving tombstoning for this purpose instead.

4.1.2 Fact

The extensional database MUST only store facts as quads (4-tuples) in EAVC1 format. All fields are REQUIRED, though the causal set MAY be empty.

Field Type Description
Entity ID Bytes Entity ID
Attribute Attribute The name of the value's relationship to the entity
Value Value The (primitive) value being associated
Causes Set CID Any direct causal links
type Fact = {
  eid: Bytes;
  attr: Attribute;
  val: Value;
  causal: Set<Cid>;
}

The first three fields (entity, attribute, and value) are analogous to a subject-predicate-object statement. For example, "the sky is blue" MAY be represented as $\langle \textsf{skyEID}, \textsf{color}, \textsf{blue} \rangle$.

4.1.2.1 Implicit CID

Each fact has an implied CID. This behaves as an index on all facts. Being derived from the hash of the fact means that the CID can always be rederived.

type CidIndex = {[Cid]: Fact}

As described in the section on time, causal relationships are one way of representing order. This is the RECOMMENDED ordering mechanism since including hashes a priori implies a happened-after relationship.2

Using the "sky is blue" example above, how would that be updated for the evening?

$$ \begin{align} \textsf{bafy...noon} &= \langle \textsf{skyEID}, \textsf{color}, \textsf{blue}, \emptyset \rangle\\ \textsf{bafy...sunset} &= \langle \textsf{skyEID}, \textsf{color}, \textsf{orange}, { \textsf{bafy...noon} } \rangle\\ \textsf{bafy...midnight} &= \langle \textsf{skyEID}, \textsf{color}, \textsf{black}, { \textsf{bafy...sunset} } \rangle \end{align} $$

flowchart RL
    sunset   -- after --> noon
    midnight -- after --> sunset

4.1.2.2 Example Graph

CID Entity ID Attribute Value Cause
bafy-abc 789 home Entity 456 []
bafy-def 246 home Entity 357 []
bafy-jkl 456 is city []
bafy-mno 456 name Vancouver []
bafy-pqr 357 name Calgary []
bafy-stu 246 last_name Zelenka []
bafy-vwx 357 is city []
bafy-yzA 789 first_name Boris []
bafy-BCD 246 first_name Brooklyn []
bafy-EFG 246 home Entity 456 [bafy-def]
bafy-GHI 789 last_name Mann []
flowchart TB
    subgraph Legend
      direction TB

      entity{{Entity}} --- att([Attribute]):::attr ----> Value
      
      %% Layout hack
      att ~~~ cause ~~~ att2
      entity ~~~ cause

      att2 -.- cause>after/cause]:::attr -.-> att
      att2  --> AnotherValue
      entity --- att2([AnotherAttribute]):::attr
    end

    Legend ~~~ entityBZ

    entityBZ{{246}} --- bzFN([first name]):::attr --> Brooklyn
    entityBZ --- bzLN([last name]):::attr --> Zelenka

    entityBZ --- bzOldCity([home]):::attr --> entityCgy
    entityBZ --- bzNewCity([home]):::attr --> entityVC

    bzNewCity -.- after>after/cause]:::attr .-> bzOldCity

    entityBM --- bmCity([home]):::attr --> entityVC
    entityBM{{789}} --- bmFN([first name]):::attr --> Boris
    entityBM --- bmLN([last name]):::attr --> Mann

    entityCgy --- calName([name]):::attr --> Calgary
    entityCgy{{357}} --- calIs([is]):::attr --> city

    entityVC{{456}} --- vcIs([is]):::attr --> city
    entityVC --- vcName([name]):::attr --> Vancouver

    style entityBM fill:orange
    style entityBZ fill:purple
    style entityVC fill:blue
    style entityCgy fill:red

    classDef attr fill:lightgrey,color:grey,stroke:none;

4.1.3 Entity ID

An "entity" MUST represent some subject in the database that can have an attribute. Since names are not unique (and are in fact an attribute), each entity needs a unique identity. Using a random number of at least 128-bits when generating a fresh entity is is RECOMMENDED.

type EntityID = Bytes

Especially since PomoDB allows for creation IDs without coordination, a real-world entity MAY be represented by multiple entity IDs. This case SHOULD be handed at the query layer by selecting for all relevant entity IDs.

4.1.4 Attribute

Attributes MUST be an integer, double-precision float, UTF8 string, or binary:

type Attribute
  = Integer // e.g. Normal indices
  | Double  // e.g. Fractional indices
  | Utf8
  | Bytes

4.1.5 Value

The EDB MUST support the following primitive value types:

type Value
  = Boolean
  | Integer
  | Double
  | Utf8
  | Cid
  | Bytes

4.1.6 Cause

Causal links MUST be expressed as an unordered set of CIDs, where those CIDs are other PomoDB fact CIDs. CIDs MUST be deduplicated at read time (though deduplicating and sorting at write time is RECOMMENDED).

type Cause = Set<Cid>

4.2 Graph Topologies

All relations in PomoDB MAY be expressed in several ways, but common terminology around graphs is RECOMMENDED as it is very clear and pictorial.

4.2.1 Grouping: Set & Stars

Grouping by entity ID, attribute, or value all produce sets. In graph terms, this is expressed as a "star" or "hub and spoke" topology.

Role Description
Hub The center of the relation / the item being "grouped by"
Spoke The thing being related
flowchart RL
    sun(("โ˜€๏ธ"))

    earth(("๐ŸŒ"))
    saturn(("๐Ÿช"))
    alien(("๐Ÿ‘ฝ"))
    uap(("๐Ÿ›ธ"))
    meteor(("โ˜„๏ธ"))

    earth --- sun
    sun --- saturn
    alien --- sun
    sun --- uap
    meteor --- sun

Multiple related hubs are possible.

flowchart RL
    sun(("โ˜€๏ธ"))

    earth(("๐ŸŒ"))
    saturn(("๐Ÿช"))
    alien(("๐Ÿ‘ฝ"))
    uap(("๐Ÿ›ธ"))
    meteor(("โ˜„๏ธ"))

    earth ------ sun
    sun --- saturn
    alien --- sun
    sun --- uap
    meteor --- sun

    moon(("๐ŸŒ–")) --- earth
    satelite(("๐Ÿ›ฐ๏ธ")) --- earth
    earth --- astro(("๐Ÿ‘ฉโ€๐Ÿš€"))

4.2.2 Ordering & Causality

An important order (or "sort") in PomoDB is causal order. This builds up a graph of pointers in the Cause field.

Below is a example causal graph from three writers. The grouping in this diagram is not important, but is presented this way here for explanatory purposes. In reality, if two writers commit the same fact to the database with the same causal history, they would be be deduplicated since they would have the same CID.

flowchart RL
    classDef genesis fill: blue;
    classDef head    fill: green;

    subgraph Alice
        almond[bafy...almond]:::genesis
        apple[bafy...apple]
        adobo[bafy...adobo]
        agave[bafy...agave]
        ambrosia[bafy...ambrosia]
        avocado[bafy...avocado]
        asiago[bafy...asiago]
    end

    subgraph Bob
        bacon[bafy...bacon]:::genesis
        bagel[bafy...bagel]
        banana[bafy..banana]
        bean[bafy...bean]
        bun[bafy...bun]
        brie[bafy...brie]
        butter[bafy...butter]
        brine[bafy...brine]
        baklava[bafy...baklava]
        berry[bafy...berry]:::head
    end

    subgraph Carol
      cake[bafy...cake]
      cinnamon[bafy...cinnamon]
      cherry[bafy...cherry]
      calamari[bafy...calamari]
      carrot[bafy...carrot]
      chocolate[bafy...chocolate]
      coffee[bafy...coffee]:::head
    end

    apple --> almond
    adobo --> apple
    agave --> bagel
    agave --> adobo
    bagel --> bacon
    banana --> bagel
    avocado ----> ambrosia
    cake --> banana
    cinnamon ==> cake
    cherry --> cinnamon
    calamari --> cherry
    carrot ===> calamari
    chocolate ---> carrot
    chocolate ==> avocado
    calamari ==> bean
    berry --> brie
    bun ==> carrot
    butter --> bun
    brine --> butter
    baklava --> brine
    brie --> baklava
    brie --> asiago
    asiago ----> avocado
    bean -----> banana
    ambrosia ==> cinnamon
    cake ==> agave
    avocado --> calamari
    coffee -----> chocolate
    baklava ==> chocolate

    %% Transitive Path Styles
        %% baklava -> chocolate -> avodcado
           linkStyle 13 stroke: DodgerBlue;
           linkStyle 28 stroke: DodgerBlue;

        %% calamari -> bean -> adobo
            linkStyle 11 stroke: orange;
            linkStyle 14 stroke: orange;
            linkStyle 16 stroke: orange;

        %% ambrosia -> cinnamon -> cake -> agave
            linkStyle 8  stroke: deeppink;
            linkStyle 24 stroke: deeppink;
            linkStyle 25 stroke: deeppink;

Note the direction of the arrows: they point from an event to their antecedent cause (i.e. they are the directed cause pointers in EVAC quads). This is sometimes counterintuitive when first encountered, since we normally reason from cause to effect. However, all facts in PomoDB are by their nature "in the causal past". Items in a node's causal history are called its "ancestors". Nodes that depend on a fact are called its "descendants". Graphs MAY be rooted.

4.2.2.1 Genesis Nodes

The "earliest" facts in the graph above are bafy...almond and bafy...bacon (highlighted in blue), as they have no asserted causes. These are called "genesis nodes" (or simply "geneses"). In formal terms, these are "causal sinks". These nodes MAY be treated as objectively the earliest in the graph: causal information is asserted at write-time, so there cannot be earlier, unlisted facts added or discovered later.

There MUST be at least one genesis node in an inhabited graph (even if it is a single-node graph). There MAY be an unbounded number of concurrent genesis nodes.

4.2.2.2 Head Nodes

The "latest" facts in the above graph are bafy...berry, and bafy...coffee (highlighted in green). These are called "head nodes", as they have no known descendants at read-time. In formal terms, these are "causal sources". These nodes are merely subjectively the most recent fact: it is possible that some other facts were written elsewhere, but simply have not arrived at the reader yet. Since more facts MAY be appended to the history at any time, the head MAY be updated, or more concurrent heads added. Despite this, the specific lineage of a particular fact are immutable.

There MUST be at least one head node in an inhabited graph (even if it is a single-node graph). There MAY be an unbounded number of concurrent head nodes.

4.2.2.3 Concurrent Nodes

When comparing any two nodes, if one does not appear in the causal history of the other, then they are said to be "concurrent" or "sibling" nodes. As any two causally disjoint graphs have (by definition) no dependencies on each other, most facts are concurrent.

4.2.2.4 Parent Nodes

Facts listed in the Cause field of another fact are said to be that fact's "parent nodes". A fact MUST have zero or more parents.

flowchart BT
    parentA
    parentB
    
    fact --> parentA
    fact --> parentB

4.2.2.5 Child Nodes

Facts that reference another fact in their Cause field are said to be that fact's "child nodes". A fact MUST have zero or more children.

flowchart BT
    childA --> fact
    childB --> fact

4.2.2.6 Ancestors & Provenance

A complete causal history is built up by recursively following parent edges, from the node being investigated back to its geneses. As long as there is an unbroken path from one node to another, it is said to be the "descendant" of its "ancestor". For example, in the above graph, bafy...avocado is an ancestor of bafy...baklava along the blue path.

As a sensible default, only direct parents SHOULD be listed in a Cause field, as the complete history is intact transitively. For example, in the graph above, bafy...ambrosia has no direct link to bafy...agave and bafy...bun has no direct link to bafy...bean because indirect, transitive histories exists (shown in pink and orange respectively). The fact that this path crosses writers or stores is immaterial.

5 Prior Art

There is a large amount of prior art in this area. Below are a few resources that were either direct influences, or frequently brought up as comparisons to PomoDB. They are presented here alphabetically:

Automerge is a nested CRDT that is capable of representing JSON. The Automerge team have developed many optimizations and demo applications for CRDTs in Local-First applications.

The BOOM project has explored Datalogs for various distributed systems use cases for well over a decade. "Disorderly programming", the CALM theorem, and Dedalus have been major inspirations for PomoDB.

Datasette is a tool for exploration, publishing, and managing relational data. A common use case for it is combining data from multiple sources and exploring it in rich interfaces.

Datomic was the first datalog experience for a lot of people. Datahike replicates a subset of the Datomic APIs, but is also able to run in the browser.

5.5 Mentat

Mentat is a client-side Datalog that strongly embraces schemaless design (though domain modeling is still encouraged).

One might say that Mentatโ€™s question is: โ€œWhat if an SQLite database could store arbitrary relations, for arbitrary consumers, without them having to coordinate an up-front storage-level schema?โ€

Cambria is a research project from Ink & Switch that attempts to solve JSON schema drift between different applications or versions. While Cambria uses lenses directly on JSON, PomoDB takes a lot of inspiration and expresses similar functionality as datalog rules.

5.7 RDF

The Resource Description Framework (or "RDF") is a mature data modeling and data store specification from the W3C. RDF has a very rich ecosystem of tools, and PomoDB draws from a lot of research based on RDF. It has been said that "PomoDB is RDF without formal ontologies and more CRDTs".

At time of writing, Soufflรฉ is one of โ€” if not "the" โ€” premier extended Datalogs. It is very featureful, with numerous additions to overcome limitations in "classical" Datalog that are otherwise limitations for richly modeling data, and working with large static data sets.

Footnotes

  1. This is sometimes called SPOC (๐Ÿ––) for "subject-predicate-object-context" โ†ฉ

  2. Assuming a cryptographic hash function. โ†ฉ

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published