Skip to content

Aggregate Ingest

Joshua Essex edited this page Oct 8, 2019 · 1 revision

In addition to ingesting individual-level data via scraping, we also regularly scrape agency-published aggregate reports. We have aggregate-based scraping for approximately 10 states. Those numbers are stitched together with aggregates rolled up from individual-level ingest for the counties in those same states, where there is overlap.

General scraper structure for aggregates

The state aggregate reports follow a different flow from our scrape-based individual-level ingest, to deal with a simpler kind of informational structure. These reports tend to be PDFs that contain tables with mostly consistently structured numbers. We persist these into dedicated aggregate tables which are exported to our data warehouse along with individual-level tables.

  1. Scrape the site where the aggregate reports are published and return links to all published reports
  2. Check each link against the Cloud Storage bucket that contains already processed aggregate reports
  3. If it does not exist, download it to the Cloud Storage bucket that temporarily holds reports to be ingested
  4. Once downloaded, a Cloud Function is triggered which calls an endpoint in our PDF Parsing service (hosted on Google App Engine)
  5. The endpoint downloads the file to local memory and passes the file to the Java-based Tabula process (installed alongside our Python app in the Docker container)
  6. Via Tabula and Pandas, the file is converted into a dataframe that contains key-value pairs for the ingested numbers, as well as contextual information to aid in downstream analysis and stitching, such as the publish timestamp and report granularity (e.g. weekly, monthly)
  7. Once processing is complete, the file is moved to the "processed" bucket so it is not downloaded again by future ingest pipelines

First time runs for new aggregate jurisdictions

The first time a new aggregate scraper is run, the call to the endpoint may timeout because it may download many PDFs. If this happens, just run it again: the process is idempotent, and subsequent attempts will only need to download fewer and fewer PDFs. One important thing to note is that if a report fails to parse, and you don't intend to fix it, you must move the file to the "processed" bucket so that it doesn't re-download it every night.