Skip to content
RDmitchell edited this page Apr 15, 2015 · 6 revisions

Overview

When a data set (CSV, Spreadsheet) is uploaded to the Web API, it is stored as a BLOB in the postgres DB as a row in the "ImportFile" table. The "ID" of this row is passed as a parameter to a Celery "task" (RPC call), using a Redis notification queue. This task is split up into "subtasks" that process

The backend worker processes

Web API

  • create_dataset
  • save_raw_data
  • get_column_mapping_suggestions

Matching Algorithm

The matching algorithm starts in the match_buildings method, by putting the main matching task out onto the celery queue. In this task, the unmatched buildings for the newly added data file are first checked to see if their building IDs exactly match any buildings in the current set of canonical buildings. During the ID matching, any of the three IDs (pm_property_id, tax_lot_id or custom_id_1) for each new building, is compared in turn with any of the pm_property_id, tax_lot_id or custom_id_1 IDs in each of the canonical buildings. Any agreement in the three ID types constitutes a match.

For buildings that do not match on IDs, a second fuzzy logic based method is employed. For each unmatched building, a single text string is generated that concatenates the 'tax_lot_id', 'pm_property_id', 'custom_id_1' and 'address_line_1', where specified, with single spaces separating the text items. This single string is intended to be a unique ID for the building that includes all four basic types of description data. The same process is applied using all of the buildings in the canonical buildings, producing a second set of text based building IDs. A fuzzy logic (ngram) search is then performed looking though the canonical buildings text IDs for matches with the text IDs of the matched buildings. The ngram search returns a measure of confidence in a match; any matches with a confidence of above 0.3 are considered a matching. The example below gives an example of buildings IDs that would have failed to match on their IDs but successfully match using the fuzzy logic method.

Unmatched building string ID = "1817 2595 spruce st" Canonical string ID= "3000417 1199 2595 spruce street"

After this process, buildings that match are saved as matching pairs; any buildings that are not matched are added to the canonical set as is.

The design intention of the fuzzy logic approach was that it would allow matching on inconsistent address syntax and potentially catch errors in spelling. A weakness in this approach is that by combining the numerical tax_lot, pm property, and custom ids, with the address it allows the possibility of false matches. False matches can occur when the two text based IDs are sufficiently similar such that the ngram returns a matching confidence that exceeds the minimum threshold of 0.3. Below is an example of Ids that give false matches.

Unmatched building string ID = "1817 2595 spruce st" Canonical string ID= "3000417 1421 spruce st"

Address Normalization

4/15/2015

We have deleted the fuzzy matching that happened after the ID matching. We have implemented an address normalization algorithm. After normalizing the address for the building record and the imported data, the program does an exact match on address.

We are using the following Python library for the address normalization: https://pypi.python.org/pypi/street-address/0.2.1