Skip to content

Data Normalization

Joshua Essex edited this page Oct 4, 2019 · 1 revision

What is the Data Converter?

The data converter is responsible for converting the parsed IngestInfo proto to a standardized python object. This process involves steps like: string normalization, date parsing, inferring unset fields and Enum parsing.

Coding

Some incoming raw fields are parsed and coded into Enums. For example, a string parsed as a ChargeDegree must be mappable to one of: FIRST, SECOND, THIRD, FOURTH, or EXTERNAL_UNKNOWN.

Because all strings for an enum field must map to an enum value, each enum defines mappings from strings to enum values. For example, the string '1' and the string 'FIRST' both map to the enum ChargeDegree.FIRST. These mappings are defined in each enum’s class and are accessed by the converter using the enum’s parse() method.

For example: ChargeDegree.parse('1') == ChargeDegree.FIRST

Note that raw text is never lost: for every enum-typed field in the schema, there is a string field for tracking the raw incoming text, e.g. race and race_raw_text.

Special Values

EXTERNAL_UNKNOWN is a special value that exists for many enums in the schema. This value should only be used if the data source explicitly lists a value as "unknown" or the equivalent; if a value is not present it should not have any enum value.

PRESENT_WITHOUT_INFO is another special value, specifically for "status" enums. It should not be directly used when creating new ingest mappings. It is used by the data converter for status enums to denote that no status for an entity was provided by the source, but the entity itself was found in the source.

REMOVED_WITHOUT_INFO is a final special value. It is only used in the situation where both a) an entity is removed from a data source, and b) we cannot infer anything about what removal means (i.e. INFER_DROPPED).

Global vs Jurisdictional

The data converter relies on a global mapping from strings to enums for each distinct enum type. The converter first normalizes the raw string to convert the text to upper case, replace all punctuation with whitespace, and strip leading or trailing whitespace. Then the normalized text is checked against the map to find the appropriate enum code. If there is no available mapping, the enum field is not set.

Individual jurisdictions can override this default behavior by defining specific enum overrides. This is useful primarily when a given jurisdiction uses values that are not clearly globally consistent values, e.g. a jurisdiction that uses integers to denote specific genders or races. The jurisdiction's enum overrides will take precedence over the global mapping for these values. These override maps can also set ignoreable values, such that if some raw text value is provided, it is explicitly ignored and no enum value set.

String values can exist in many different enum type maps without conflict, since the maps are used in the context of coding a specific type. For example, "MEDIUM" exists in both the global mapping for assessment levels and for supervision period levels.

How to use Enum Overrides

Enum Overrides can be set for each individual jurisdiction by overriding the get_enum_overrides() method in the scraper (for scraper ingest) or controller (for direct ingest). This method returns an EnumOverrides object that contains additional mappings based on exact string matches and predicates. The object can also specify strings that should be ignored.

# Map all parsed gender strings 'U' to Gender.EXTERNAL_UNKNOWN
overrides_builder.add('U', Gender.EXTERNAL_UNKNOWN)

# Ignore all parsed gender strings that start with the string ‘NONE’
overrides_builder.ignore(lambda s: s.startswith(‘NONE’), Gender)

Additionally, EnumOverrides can define mappings across enum fields within the same entity object. For example, a website may list values that we parse as Race under the Gender field. The overrides object could then define a custom mapping:

# Map all parsed gender strings 'B' to Race.BLACK
overrides_builder.add('B', Race.BLACK, Gender)

Because the BaseScraper includes a set of default enum overrides (e.g. to correctly map ethnicities that are parsed as a race), additional enum overrides should be built upon the EnumOverrides returned from calling super(MyNewScraper, self).get_enum_overrides().to_builder()