Skip to content

ActivitySchema/ActivitySchema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Activity Schema data model is a standard designed for modern data warehouses. It is designed to make data modeling and analysis substantially simpler, faster, and more reliable than existing methodologies (e.g. a star schema)

The core approach is to separate modeling from querying. Models should be fully independent: asking new data questions should not require creating new models. Instead, queries can combine any number of existing models to build anything they need.

This is done by modeling all data in the warehouse as a single time series table.

For more info also see the home page or read the spec.


How it Works

At its core an activity schema consists of transforming raw tables into a single, time series table called an activity stream. All downstream plots, tables, materialized views, etc used for BI are built directly from that single table, with no other dependencies.


Traditional Data Modeling - many custom-structured tables

image



Activity Schema - one standard table

image

Business concepts are represented as entity doing an activity ('a customer completed an order') instead of facts or nouns (orders, products). Activities are built directly from source tables, store only their own data, and are the single source of truth for each concept.

All queries run against an activity stream table to assemble data for analysis, BI, and reporting. Instead of traditional foreign key joins, queries combine activities using relationships in time (e.g. all customers who completed an order and submitted a support ticket before their next completed order).



Advantages

Maintainable data models

  • fewer models — one (and only one) business concept per activity means fewer models to manage, understand, and maintain
  • easier to build - no joins between models means no need to tie disparate source systems together.
  • easily accommodate changes to source data — only need to update a single activity
  • simple data lineage — a single data layer makes tracing data provenance and debugging far easier
  • faster updates — time-series modeling means incremental updates (rather than full rebuilds) by default
  • lower data latency - no dependencies means data is available upon insertion, rather than waiting for cascading rebuilds
  • no data dictionaries — fewer models, with one concept each, makes them vastly easier to document

Faster analysis and querying

  • single source of truth — because activity represents a single concept (like a 'page view' or 'completed order'), it's always clear which activity to use
  • query across many source systems — time-based joins means any activity can be queried and combined with another without defining foreign keys
  • reusable analyses — a standard data model means that any analysis can be reused across companies. This means a customer acquisition cost calculation for one company can be shared with another
  • autogenerated queries — a standard data model means that queries don't have to be written by hand
  • true ad-hoc querying — because all activities are related in time, swapping one activity for another requires no structural changes to queries.
  • high performance — queries run substantially faster against an activity stream table, which has fewer columns, requires fewer joins, and can be easily partitioned / indexed by time.


Spec

Learn more by reading the full activity schema specification.


Resources

The activity schema home page has more info on the approach and its benefits


Building an activity schema

The implementation page covers how to build an activity schema in production.

A new activity schema dbt package has some helpful macros for building activity stream tables.


Known Implementations

Narrator provides a full implementation of the activity schema as a service. The Narrator team directly supports the growth of the activity schema, and is always happy to provide suggestions on how to build one yourself.


Community

The dbt Slack has a #modeling-activity-schema channel for discussion about the activity schema.