Note: As a part of the migration to python 3
, this code no longer actually fetches records or writes out metrics. The logic to do the calculation still exists, but it is left in a non-working state until it is adapted to our new schema.
This package contains a recidivism calculation pipeline built on Google App Engine’s MapReduce library.
This calculates both event- and offender-based recidivism metrics across a variety of dimensions, including location, sentence history, offense characteristics, and more. All metrics are normalized by release cohort, the calendar year in which a subset of the person population is released from prison, and by follow-up period, the number of years after release of prison during which recidivism was measured. For example, this will calculate the overall recidivism rate for the 2005 release cohort over a 10-year follow-up period, or recidivism rates over a 10-year follow-up period for white females aged 30-34, released from Sing Sing, who were arrested for drug possession.
All calculated metrics are stored in Cloud Datastore for later use.
More on the precise calculation methodology can be read here.
The calculation pipeline will fail at startup (runtime) on your local machine if you do not uncomment the two
flagged lines in app.yaml
: module: default
and version:default
. There is an outstanding item to make this
configuration dynamic by environment. For now, remember to uncomment these when running locally.
The calculation pipeline runs in three steps.
First, it maps
each Person
and their personal set of Records
(instances of incarceration) into an array an array of tuples whose first element is a "metric combination",
a set of key-value pairs that describe that particular record (i.e. release cohort, follow-up period,
methodology, and other optional characteristics), and whose second element is a "recidivism value" that will
be counted toward recidivism rates across all records that share that metric combination. Because we track
all possible combinations of relevant characteristics for a particular record, the number of metric combinations
is much higher than the total number of records counted. If recidivism did occur, then the recidivism value is
a 1 for event-based measurements and 1 / k for offender-based measurements, where k is the number of releases
from prison for that person in that particular follow-up period.
Then, automatically, it shuffles
all metric combinations to group equal ones. This produces an array of key-value
pairs where each key is a unique metric combination and each value is an array of recidivism values that share that
combination. This is, essentially, the set of instances of recidivism or not-recidivism for each metric we intend to
calculate.
Finally, it reduces
each key-value pair into a RecidivismMetric
, including the characteristics in the metric
combination and the recidivism rate taken by dividing the sum of all recidivism values over the total number of
instances being counted, i.e. the total number of releases for that combination. Each RecidivismMetric
is written
to Cloud Datastore where it can be read and searched for later.
See the top-level readme for more information on setting up Google App Engine locally and running the Recidiviz app on the local development server.
After launching the app, launch a calculation pipeline by navigating to localhost:8080/calculator_pipeline
in
your browser. More generally, the API for launching calculation is:
GET /calculator_pipeline[?region={region}&include_conditional_violations={true|false}]
...
302/Redirect
There is no command for pausing a pipeline in progress and no particular command for stopping a pipeline, besides
killing the app server process in some way, e.g. CMD+C
.
Upon successfully starting the pipeline, the server redirects the client to http://localhost:8080/mapreduce/pipeline/status?root={pipeline_id}
,
a page which shows the running status of the various steps in the pipeline. This page shows basic operational
statistics, such as success versus failure, IO metrics, and runtime, the current work on each provisioned shard,
and any counters we are manually tracking, among other metadata.
The status of all past and current pipelines can be found in a table at http://localhost:8080/mapreduce/status
,
including the individual map, shuffle, and reduce steps for each pipeline.
To see the recidivism metrics produced by a particular pipeline, copy the id (from the "ID" column) from the row for
the reduce step in that pipeline on the previously mentioned table, and search the RecidivismMetric
entity in
Cloud Datastore where execution_id
equals that reduce step id.
Counters being tracked in each pipeline include:
total_people_mapped
(map step): the total number of people passed through the map step. This should equal the number of person entities in Datastore.total_metric_combinations_mapped
(map step): the total number of metric combinations created by the map step.unique_metric_keys_reduced
(reduce step): the number of unique metric keys passed from the shuffle step to the reduce step. This should be strictly less thantotal_metric_combinations_mapped
as it effectively de-duplicates that array of results.total_records_reduced
(reduce step): the number of tallied records that were passed to the reduce step. This should be equal tototal_metric_combinations_mapped
as it is a transformation from all metric combinations into the sum of instances of each unique metric combination.total_recidivisms_reduced
(reduce step): the number of tallied records that resulted in recidivism that were passed to the reduce step. Dividing this bytotal_records_reduced
will not give any particularly meaningful approximation of overall recidivism. But the number can provide a sanity check: it should be some reasonable percentage oftotal_records_reduced
, maybe somewhere around 15%-40% depending on region.
Logs are printed via stdout by default for the local development server. In production, logs can be read in the Cloud Console.