Skip to content

Data Ingest

Joshua Essex edited this page Oct 8, 2019 · 3 revisions

Criminal Justice Data

Criminal justice data is notoriously difficult to manage, even more so to put into action to power reform efforts. Getting data that’s fresh and granular enough to change policy and practice is a herculean task. Products that could drive behavior change and improve outcomes are rare because the data to power them isn’t in a usable state. Several key barriers inhibit access, understanding, and utility of criminal justice data, including friction in obtaining access, insufficient sharing and tracking across systems, infrequent and irregular collection, lack of standards in collection and measurement, and impenetrable reporting.

We ingest data from fragmented criminal justice data silos, including jails, probation, prisons, and parole. In each jurisdiction, we establish an automated process that brings in new, raw data nightly or weekly and runs it through a pipeline that transforms it into a standardized, universal format for analysis.

The ingest pipeline is designed to take lossy, uneven data across many silos, that exists in various organizational structures with various assumptions, and produces a unified and generalized dataset.

Ingest Channels

There are currently three types of channels for ingest: scrape-based individual-level ingest, scrape-based aggregate-level ingest, and direct ingest. These are described in their respective wiki pages.

  • Direct Ingest - broadly speaking, this is the direct exchange of data from justice agencies to the Recidiviz platform. This can take a number of forms, either push-based or pull-based. Either way, we create one to many yaml files to control the mapping of their data into our schema and extend the appropriate controller logic.
    • For push-based ingest, we create a secured upload location to which only the agency is permitted to write, and provide an encryption key and a simple upload command to agency IT staff, assisting them in integration testing. They set up a cron schedule to execute this command with their existing tools.
    • For pull-based ingest, we schedule a call to our controller for this region, which makes the appropriate request to the agency's system with whichever authentication scheme is available. We only provide desired credential information to the agency IT staff and/pr ensure they open up any firewall exceptions, as required.
  • Scrape-Based Individual-level Ingest - many correctional systems, court systems, and other agencies publish information about individuals interacting with the system to publicly available websites, either as a public benefit or a matter of compliance with applicable law.
    • We identify those websites which we are permitted to scrape and pull in their information on a nightly basis, navigating their page structure and normalizing their data entities and fields into our own schema.
  • Scrape-Based Aggregate-level Ingest - many governments publish pre-aggregated reports about the performance of their agencies and justice systems. Sometimes these are rolled up to the state level, and sometimes these are broken down at lower levels, such as the county, court, or facility.
    • Regardless of whether this information overlaps with the individual-level scrapers we have for the same jurisdictions, we want to gather and standardize this information to assist in analysis and validation of our own efforts. We scrape the sites where these reports are published to identify new reports, parse the aggregates out of the report structure, and save them to our data warehouse.

Protocols

Regardless of ingest channel, we can parse information provided in many different protocols. At present, this includes at least:

  • Flat files - csv, json, html, and xml via our data extractors. Extending support to other flat file protocols is simply a matter of building a new implementation of the data extractor interface.
  • API requests - requests directly to APIs that expose data with structured responses, typically as json or xml. Once the response is received, it is passed into the relevant data extractor as described above.
  • Database dumps - for either push- or pull-based direct ingest, we are able to take in database dumps of a source data system, save it to a dedicated/isolated instance in our Cloud SQL cluster, and then query result sets out of that database dump to be processed.
  • PDF files - via Tabula and Pandas, we can parse PDFs into structured tables of data as key-value pairs. This is a somewhat static setup as it relies on an understanding of how the PDFs are structured ahead of time. These PDFs are parsed via our PDF Parsing service hosted in Google App Engine.