Skip to content

Entity Matching

Joshua Essex edited this page Mar 8, 2021 · 1 revision

What’s the problem?

When Recidiviz partners with states, we receive the state’s criminal justice data. Generally, we receive a "historical" export - which contains data for all relevant state-side tables stretching as far back in history as there is data - and then from there on out, we receive regular exports, which usually just contain new updates to the state-side tables. This process is documented in more detail elsewhere in this wiki, but every time we receive state data, we process, normalize, and generally store it in our Postgres DB. This is the "ingest" process.

After we have data that exists within Postgres for a given state, whenever we go to ingest new information for that state, we have to decide if the new information:

  • Represents an update to an existing entity/row in one of our tables.
  • Represents an entirely new entity/row that needs to be created within one of our tables.
  • Represents a new relationship between any two entities/rows in different tables.

This process of identifying if the new data we’re ingesting "matches" existing entities within our DB and subsequently either updating or adding entities to our tables, is entity matching.

How does it work?

Our schema graph can be viewed as a tree with StatePerson as the root node. By the time the ingested information reaches the entity matching stage, all new information has been normalized and converted into a group of entity trees with StatePerson as the root nodes.

Entity matching recursively walks the ingested entity tree, and at each stage it attempts to match the ingested entity with DB entities of the same type in the search space.

Entity matching is performed serially by matching each ingested entity tree one at a time. Therefore at the onset, the ingested entity to be matched is always a StatePerson entity. At the beginning, the search space of potential DB matches contains all StatePerson entities within our Postgres DB for the given state.

At a high level, the recursive algorithm:

  • Checks if the current ingested entity matches any of the DB entities of the same type in the search space.
  • If there is a match:
    • "Merge" the ingested entity onto the DB entity, by overwriting any old information on the DB entity with newer information from the ingested entity.
      • Don't worry: historical snapshotting is going to ensure that we don't lose any of the information we overwrote.
    • Perform entity matching for all children entities of the ingested entity, limiting the search space of potential DB matches to just the children of the matched DB entity.
      • The results of entity matching for all children are added to the updated DB entity. That means children of this entity will include:
        • All new child entities (no matches in the DB)
        • All merged child entities (overwritten versions of entities already in the DB)
        • All DB child entities that weren’t updated.
  • If there is not a match, and the current ingested entity is a placeholder:
    • Perform entity matching for all children entities of the ingested entity, without restricting the search space at all.
      • Any children that match existing DB children (and have therefore been "updated" by entity matching) are moved off of this ingested placeholder entity and onto the corresponding DB entity.
    • If there is not a match, and the current ingested entity is not a placeholder:
    • Perform entity matching for all children entities of the ingested entity, limiting the search space of potential DB matches to just the children of placeholder DB entities.
      • Any children that match existing DB children (and have therefore been "updated" by entity matching) are moved onto the ingested entity and removed from the placeholder DB entity.

At the end of entity matching, the list of brand new, ingested StatePerson entity trees as well as updated, DB StatePerson entity trees go through post-processing (namely merging of multi-parent entities, described below) and eventually persisted to our DB.

How do we know if entities match?

All entities within our schema have a field called external_id (save for StatePerson, which can have multiple StatePersonExternalIds of different types). These external ids represent the unique identifier for this entity that the state itself uses for this entity.

Occasionally, we’ll generate our own external_ids, which are usually combinations of various columns. This can happen when our definitions of entities differs from those of the states, and so we’re forced to split or combine rows from state-side tables into our own entities. Often times, entities such as StatePerson, StateSentence, StateCharge, etc. have 1:1 mappings from the state tables to our own tables, and so we will just propagate the state id as our external_id. However, when our entity definitions do differ from those of the state, we’re forced to create our own external_ids (most commonly StateSupervisionPeriod and StateIncarcerationPeriod).

Regardless of whether we propagate an existing external identifier or deterministically create our own from other fields, we typically entity match on that field. This requires that when we compose our own external id from other fields, that combination of fields must represent a primary/unique key within the raw data file.

However, when no such primary/unique key is derivable, we can also match entities based on an equivalence check of other fields aside from the external_id. Deciding which entities should match based on an equivalence check of other fields can be configured in the state-specific matching delegate, described later in this document.

Complexities that entity matching supports

Merging of multi-parent entities into 1 entity

There are certain entities that can have multiple parents. For example, both StateSupervisionPeriod and StateIncarcerationPeriod can have StateSupervisionSentence and StateIncarcerationSentence as parents.

Due to the way that entity matching restricts the search space as it recursively iterates throughout the tree, it is theoretically possible to end up with two entities on a single StatePerson tree that need to be merged.

In the example of a StateSupervisionPeriod, it’s possible that there are matching StateSupervisionPeriods hanging off of a StateIncarcerationSentence and a StateSupervisionSentence within one StatePerson. Our system will make sure to merge such entities before committing anything to the DB.

Matching of entity trees with "placeholder" entities

Although less frequent since SQL-based preprocessing was introduced to the beginning of the ingest process, it is definitely possible that the ingested tree from a single ingest view contains "placeholder" entities, described later in this document.

Entity matching supports the ingestion of placeholder entities by:

  • Moving children from ingested placeholders onto non-placeholder DB entities, if the children of the ingested placeholder and the non-placeholder DB entity match.

    Ingested: StatePerson` -> StateSentenceGroup (placeholder) -> StateSupervisionSentence`

    DB: StatePerson -> StateSentenceGroup -> StateSupervisionSentence

    Result: StatePerson` -> StateSentenceGroup -> StateSupervisionSentence`

  • Moving children from DB placeholder entities onto ingested non-placeholder entities, if the children of the ingested entity and DB placeholder match.

    Ingested: StatePerson` -> StateSentenceGroup` -> StateSupervisionSentence`

    DB: StatePerson -> StateSentenceGroup (placeholder) -> StateSupervisionSentence

    Result: StatePerson` -> StateSentenceGroup` -> StateSupervisionSentence`

Ingesting new entities with non-person root classes

Note: This use case should effectively disappear with the introduction of SQL-based preprocessing. For efficiency sake, all processed ingest views should have a reference to the StatePerson.

In states that were not created with SQL-based preprocessing (ND), it is possible to ingest a file that does not have a reference to the StatePerson. In ND, some files only have references to the StateSentenceGroup (one level beneath StatePerson). Entity matching supports ingesting files that start at a non-person level. In these cases, as long as the DB already includes an entity tree that includes a relationship between that, say, StateSentenceGroup and the StatePerson, then entity matching will match to the StateSentenceGroup and in turn to the StatePerson.

Concepts

"Placeholder" entity

These are entities that contain no useful information from the state, and are only in our schema to connect two otherwise unconnected parts of our entity tree.

For example, this could happen if a single ingest view had information about a person and their supervision sentences, but none of the information about their sentence group.

In this case, our ingest tree would look like:

StatePerson →  StateSentenceGroup (placeholder) → StateSupervisionSentence

As we don’t have information about the StateSentenceGroup it’s a "placeholder" in this chain.

Generally you can tell if an entity is a placeholder entity if none of the fields are filled out in our schema, save for default enum values.

Standalone entity

An entity that conceptually does not belong to a StatePerson, but is still attached to that person tree. An example is a judge (modeled as a StateAgent). On a schema level, these entities do not have person_id foreign keys.

Dangling placeholders

Dangling placeholders are placeholder entities that have no non-placeholder children. While dangling placeholders should never be ingested, they can be created through multiple entity matching runs. This happens when a DB placeholder entity starts out with a non-placeholder child, but due to a subsequent ingest run, that child is moved onto a non-placeholder parent.

State-specific hooks

While most of the entity matching algorithm is state-agnostic, our entity matcher supports the addition of state-specific logic in several places. Check the documentation in the base_state_matching_delegate module for more details. But note that several of the most commonly configured state-specific additions are:

  • Non-external id matching for specific entity classes
    • For example, if in a state we know that StateCharges never have external_ids but there’s only ever one, unchanging, StateCharge per sentence - we can configure entity matching to always call StateCharges a match if all non-external_id fields match.
  • State-specific pre-processing
    • Code that is executed before entity matching begins. This tends to be done to "cure" some interesting result of previous steps in the ingest process.
    • For example, ingest of incarceration periods in ND can create multiple "incomplete" periods that can be merged together. This is done prior to entity matching to ensure that we have "complete" periods which will be amenable to the matching process.
  • State-specific post-processing
    • Code executed after entity matching finishes, but before results are committed to the DB.
    • A common example is matching, say, StateSupervisionViolations into StateSupervisionPeriods by date where there is no discrete reference from the former to the latter. Violations with violation dates between the supervision period start and termination dates would be matched in this case.

Known limitations

No merging of "standalone" entities

Entity matching works exclusively within a single StatePerson tree. As standalone entities, by definition, apply to multiple people at once, we currently do not merge standalone entities - even if they share the same external_id.

This means that grouping of standalone entities must occur later in either BigQuery or Dataflow. Future schema updates should lift this limitation.

Matching happens serially

Entity matching can be relatively slow. Currently we loop through all ingested StatePersons and attempt to match them to DB entities serially. This could be optimized by parallelizing parts of this process.

No entity deletion

This is not an entity matching specific problem, but we do not have ways to safely delete entities from our schema during ingest. In entity matching, we can occasionally create "dangling placeholders". These are placeholder objects that have no non-placeholder children. In an ideal world, we’d delete these entities, as they serve no purpose; however we cannot do that currently.

There is no technical issue with these dangling placeholders, it’s just extra clutter in our tables that isn’t immediately intuitive. We intend to add support for safe deletion of entities during ingest, which would solve this problem of dangling placeholders.

Entity matching errors stop updates midway through the tree

Occasionally there are errors in entity matching. This can happen when one ingested entity matches multiple DB entities, or vice versa. When this happens, entity matching is stopped for the specific StatePerson tree, and an error is counted. At this point the StatePerson tree could be partially updated (i.e. we matched and updated some higher-level entities, but an error was thrown at an entity much farther down the entity tree). When there are enough errors within a single ingest run, ingest halts altogether. However if we are beneath that threshold, we persist the results of the ingest run.

This means that if a small number of entity matching errors are thrown, such that we stay beneath our threshold, we could ingest partially updated information for the problematic StatePerson trees. In practice, we set error thresholds in production to low enough levels that this should rarely happen, but there is a non-zero chance and the effects are difficult to predict.

Developer tips to make entity matching easier

There are several things to keep in mind when creating new ingest views that should help avoid any entity matching issues.

Link all ingest views to the state-provided person ID

Entity matching is much more efficient when StatePerson entity is not a placeholder entity (we can immediately find a match and restrict the search space for all children). For this reason we want all of our ingest views to include the external id of a StatePerson.

This should always be possible thanks to SQL-based preprocessing!

Eliminate placeholder entities when possible

If it’s possible to construct your ingest views such that there are external ids for every entity in the chain of ingested entities (i.e. no placeholders), you should do that. Placeholder entities both make entity matching less efficient and clutter the DB, which is exported to the data warehouse and reduces readability for folks conducting data exploration and analysis.

Give everything an external id when possible

If a single column, or a combination of columns, on the state side creates a primary/unique key for one of our entities, use it as an external_id. If there are ways to create stable external ids ourselves, then do that. Entity matching is much simpler and more efficient when more entities have external ids.

Clone this wiki locally