Skip to content

Scrape Based Ingest

Joshua Essex edited this page Oct 8, 2019 · 7 revisions

Scraping

Most correctional systems in America provide search engines, used to look up someone's location within the system, when they might be released, why they are there, and so forth. These vary in function and in how you query from them -- some require just a last name, others require a full first and last name. Some show people who've been released, others don't. Some of them provide a flat list of people who matched a particular query, their birth dates, but some require navigating a page-tree structure.

Our scraping platform provides a configuration-driven way to declare how a particular jurisdiction's data systems are structured and pull that information into our system at scale: we scrape corrections data from nearly 1,000 counties and counting, every night. Most of these scrapers require little code beyond boilerplate, just yaml.

General scraper structure

All individual level scrapers have a corresponding data system that they are scraping. The flow for a scraper looks like this:

  1. A nightly cron job (typically 9pm in the region's local time) will call the start command on a scraper.
  2. Upon receiving start, a region opens a new Session and adds the first task for a region to its task queue
  3. Every task makes a request to a page and either adds more work on to the queue, or scrapes data from the page and puts it into the structured ingest_info object.
    1. For more information on the task types and what the individual level scrapers need to implement, see Create a Scraper.
  4. Each outbound request goes through a proxy so as to protect the Google Cloud Platform IP reputation and reduce odds of accidentally trigger DDoS protection.
  5. For every task that scrapes data meant to be persisted (as opposed to merely passing through navigational structure), it throws the ingest_info on a Pub/Sub topic for batch persistence.
  6. Once a scrape session is complete, a batch persistence process is triggered which reads all of the messages from the topic and persists it all in one transaction. Every person's entity graph goes through entity matching, data conversion, and validation. The output is a standardized set of related objects that are written to the database.

Vendor or Region

Scraper implementations can be either vendor scrapers or region scrapers.

  • Vendor scraper - this is a scraper built for some particular vendor that provides the website for the jurisdiction we are scraping, e.g. a jail roster website provided to county corrections systems. Some vendors have large numbers of clients and their websites will be consistent, so building a scraper for the vendor provides the logic necessary to scrape all of the client websites. In these cases, the vendor scraper is extended by a boilerplate region scraper for each such client, and the manifest.yaml for that region declares the vendor.
    • Most vendors provide complete consistency between different client regions, but some do have minor bits of variation. Where there is variation, the vendor scraper tends to use abstract methods that the region scrapers override, as minimally as possible.
  • Region scraper - this is a scraper built specifically for a particular region's website, e.g. a single county jail roster website. The actual scraper logic should still be fairly thin as we have pushed much of the navigation logic up to the orchestration layer and virtually all of the ingest logic down to the configuration-driven pipeline. But some websites do prove complex or quirky and require a bit more up-front effort to get right. Fortunately, these websites change incredibly infrequently and ongoing maintenance is almost never required -- anecdotally, we are aware of only a couple of the hundreds of websites that we scrape that have changed since we began.

Task Queues

As noted above, the scraping system is built atop a task queue system, specifically Google Cloud Tasks. For the most part, every jurisdiction that we scrape has its own dedicated queue, though some of the major vendors behind sites we scrape have multi-tenant infrastructure (i.e. a backend that hosts the data systems for many jurisdictions simultaneously) and we create a shared queue for all sites backed by such vendors, to ensure global rate limiting for all requests to those backends.

Each new request generated by a scraper, whether to obtain information to write to the database or to simply navigate the page tree structure of a given site, is given its own task. Tasks propagate context to tasks that they spawn, as needed -- for some sites, this involves sharing session information that's required to navigate through search results; for others, this involves passing partially built up ingest results when multiple requests are required to retrieve the data for a single person. Tasks return HTTP status codes on completion to signal success or failure, and the orchestration logic (in our main platform application layer, hosted in Google App Engine) handles retry or launch of new downstream tasks as necessary.

Orchestration

Cloud Tasks automatically manages everything in the queue orchestration layer, such as retry of failed tasks, rate limiting, the release of tasks from queues to the designated worker request handler, ensuring delivery semantics approaching exactly-once, and so forth. In our application layer, we implement the proper request handlers to ensure the correct work is done at the right time, and we provide the scraper interface to make it easy to define navigation structure, share context between tasks, and compose results into entity graphs for the ingest pipeline.

A few internal API calls are available to control scraper function, including starting new sessions, pausing existing sessions, and stopping sessions entirely. The top-level scraper.py abstract class has the primary logic for responding to these requests, with a few abstract methods to delegate navigational structure to the specific scraper, be it vendor or region scraper.

Example workflow

As an example, a scraper may be created for a jurisdiction with a page tree structure that needs to be navigated, and a search form that will return all results when the name fields are empty. Its work might look like this:

  1. Enqueue task to request for main search page of person system, to get session variables
  2. Queue releases task from #1 to a scraper worker, which begins to execute
  3. Scraper sends request to website
    1. Scraper receives search page
    2. Scraper parses search page for session variables
    3. Scraper enqueues task to submit a query using those session variables with empty values for the name fields
  4. Queue releases task from #3.iii to a scraper worker, which begins to execute
  5. Scraper sends request
    1. Scraper receives results page
    2. Scraper parses results page, which returns a list of links that each lead to a page containing information about a single person and a "Next page" button that goes to the next page of person links
    3. Scraper enqueues a separate scraping task to follow each link in search results
    4. Scraper enqueues a task to follow the "Next page" button
  6. Queue releases one of the tasks from #5.ii, which begins to execute
    1. Scraper receives results page
    2. Scraper parses results page, composing an IngestInfo entity graph for the person on the page
    3. Successful in parsing the page, the IngestInfo is published to the batch persistence Pub/Sub topic to be saved to the database when the session is complete and validated
  7. ...

Authentication

All scraper actions are made through server requests to our main platform application layer in Google App Engine, with options passed as query parameters. All request handlers within the Recidiviz platform require authentication to be callable only from platform services themselves (e.g. a cron job or a scraping task), or from an explicit API request by an authenticated user.

This is trivial to configure by adding the @authenticate_request decorator to request handler functions.

Respecting other corrections system users

Since data systems are intended to benefit the public and agency staff, it's absolutely critical that we work in good faith to preserve the responsiveness of these systems for other users. We strive our hardest to be good citizens.

Policy restrictions

Corrections systems often mention specific restrictions on automated use of these systems in either a robots.txt file directly under the top-level domain, or in site Terms and Conditions. Whenever we add a new scraper, we verify that it adheres to the latest version of both of these two documents.

Rate limiting

By default we restrict our queries to the highly conservative 5 queries per minute.

We also automate scraping for times of day when other users of data systems are less likely to be using them (i.e. starting at 9pm local time and stopping at 9am local time if the session is not yet complete) to further protect other intended user groups of these systems from any potential impact.

Contact info

We provide a contact e-mail address in the scraper's user agent string to allow administrators of a search system to reach out to us if they notice any problem. It's important for us to be accessible to these groups in case issues develop that we aren't aware of.

Proxying

As mentioned under General scraper structure above, we pass each outbound request through a proxy to ensure we don't damage the IP reputation of Google Cloud Platform, and avoid accidentally triggering DDoS protection and generating unnecessary late-night pages.