Skip to content

Architectural Overview

Joshua Essex edited this page Jul 29, 2020 · 2 revisions

Architectural Overview

The Recidiviz data platform consists of three major parts: ingest, calculate, and report.

  • Ingest takes in raw criminal justice data sources and transforms them into a common, standardized data layer.
  • Calculate operates atop this common data layer, tracking the key performance indicators and baseline metrics of the criminal justice system, within and across jurisdictional boundaries. Deeper analytics build upon these calculations to produce outcomes-based measurement and evaluation.
  • Report consists of a variety of tools and apps to provide this evaluation to the right person at the right time, driving interventions and reform efforts.

The platform documented herein is focused largely on ingest and calculate, while reporting tools tend to exist outside of this platform.

Cloud-Based

Our system is built atop and deployed to Google Cloud Platform. We leverage fully managed services for speed and effectiveness so that development can remain focused on how to reduce incarceration quickly and safely, instead of how to keep servers running. Our service layout is roughly as follows:

This chart is non-exhaustive:

  • The ingest box in the top left is focused on direct ingest, i.e. the direct upload of data records from source systems into our platform. Scrape-based ingest also exists and makes use of most of the same services.
  • The reporting box in the top right is a non-comprehensive example of how one reporting tool, our Dashboard web app, can consume from the platform. Other kinds of tools can access data through other pathways.

End-to-end, briefly

From end-to-end, the following sequence of events happens with every new ingest of source data, automatically:

  1. Multiple entryways exist to bring source data into the ingest pipeline, both push- and pull-based.
  2. Once triggered, the ingest pipeline funnels data through a series of transformations until the data has been organized into our common schema and validated.
  3. This organized data is persisted into our production database, either one-by-one or in batch.
  4. The production database is exported in full to our data warehouse every night.
  5. From there, both batch-oriented and query-based processing generate an extensible set of metrics. Queries can reference other query results or even the results of batch jobs.
  6. Data scientists and researchers can consume and build atop both of these directly within the data warehouse to add new metrics.
  7. Finally, APIs make these metrics and analysis available to downstream applications.

Network Architecture

Our system is deployed into two separate, functionally identical environments: staging and production. Staging is used for all forms of development and testing, while production is used exclusively for user-facing operations.

Our generally networking approach is to ensconce as many of our internal services as possible within Google's Identity Aware Proxy (IAP) which provides ensures that both authorization and authentication have occurred prior to allowing any requests to reach one of our services. This is based on IAM permissions, which we configure with a set of custom roles for different kinds of internal users. The primary components which exist outside of the boundaries of IAP are those which are required to enable user-facing functionality to operate, such as the Metric API servers that back up our Pulse Dashboard application. However, we also have a few Compute Engine VMs which are explicitly whitelisted for access to certain data stores for research and operations purposes. Even so, these are also restricted by IAM permissions