Skip to content

Releases: openvenues/libpostal

Walla Walla

09 May 18:37
9c97597
Compare
Choose a tag to compare

This release adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.

Near-duplicate detection

Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:

  1. Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "#2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses. For street names, we additionally strip whitespace from the root expansion so "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".

  2. Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.

  3. Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.

  4. Acronym generation: when generating hash keys for names, we also try to guess acronyms when there are two or more tokens. While the acronym alignments downstream can be more exact, there are cases like "Brooklyn Academy of Music" vs "BAM" which would not match a single token using the previous methods, but are the same place nonetheless. Acronym generation attempts to solve this problem for venue names, human initials, etc. It is stopword-aware in various languages, but generates multiple potential forms of the acronym for cases like "Museum of Modern Art" vs. "MoMA" where the stopword is included in the acronym. The method also includes rudimentary support for sub-acronyms i.e. generating new tokens at every other non-contiguous stopword (so "University of Texas at Austin" will produce "UT"), as well as at punctuation marks like commas or colons (e.g. "University of Texas, Austin"). Acronym generation also leaves intact any known dictionary phrases that resolve to multi-word phrases so e.g. "Foo High School" and "Foo HS" can share a hash key, and also keeps any acronyms identified as part of the tokenizer from the internal period structure like "A.B.C." All of this also works across scripts for any non-ideographic languages. Though this method will not cover all valid acronyms directly, we use a double metaphone on the output in Latin scripts so with certain types of more complex acronyms like "sonar" vs. "Sound Navigation and Ranging", where more than the initial character of each word is used, the generated acronym may still be phonetically the same as the real acronym (in this case, both resolve to "SNR"), especially if the other letters used are vowels.

  5. Name quad-grams: in order to handle a variety of spelling differences, as well as languages with longer concatenated words as in German, all of the name hashes mentioned above are converted in to unique sequences of 4 characters. Simple numeric tokens consisting of digits and hyphens, etc. are included as-is, without the quadgrams or double metaphone encoding. For the words that do use quadgrams, there is no token boundary disambiguation i.e. the double metaphone for "Nationalgalerie" would be "NXNLKLR" which would then generate strings ["NXNL", "XNLK", "NLKL", "LKLR"] (as opposed to the behavior of our languages classifier features which would generate ["NXNL_", "_XNLK_", "_NLKL_", "_LKLR"], using the underscore to indicate whether the 4-gram is at the beginning, middle, or end of the word). Since the fully concatenated name without whitespace is also double metaphone encoded and hashed to 4-grams in this fashion, it should also account for variance in the phonetic output resulting from spacing/concatenation differences (vowels at the beginning of a word are usually converted to "A" in double metaphone, whereas they're ignored in the middle of a word). Overall, though this method does slightly reduce the precision of near-dupe hashing (causes us to identify more potential dupes and thus make more pairwise comparisons in the next stage), it also increases recall of the process (we don't miss as many true dupes).

Component-wise deduping

Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.

The duplicate functions return one of the following values:

  • LIBPOSTAL_NULL_DUPLICATE_STATUS
  • LIBPOSTAL_NON_DUPLICATE
  • LIBPOSTAL_POSSIBLE_DUPLICATE_NEEDS_REVIEW
  • LIBPOSTAL_LIKELY_DUPLICATE
  • LIBPOSTAL_EXACT_DUPLICATE

The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.

Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).

Since identifying the correct language is crucial to effective matching, and individual components like house_number and unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.

Fuzzy deduping for names

For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.

In this release, we build on the idea of Soft-TFIDF, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or other similar term weights, supplied by the user in our case, see the lieu project for details on constructing indices based on TFIDF or information gain (performs better in the venue deduping setting), and their per-geo variants.

Here's how it works:

  1. for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similar...
Read more

Brooklyn 99

07 Apr 21:48
Compare
Choose a tag to compare

The great parser-data merge is complete. Libpostal 1.0 features a better-than-ever international address parser which achieves 99.45% full-parse accuracy on held-out addresses. The release title is a reference to the TV show (libpostal was also created in Brooklyn and this was the first version of the model to surpass 99% accuracy). Check out the blog post for the details. Here's a sample of what it can do in a GIF:

parser

Breaking API Changes

  • Every function, struct, constant, etc. defined in the public header (libpostal.h) now uses a "libpostal_" prefix . This affects all bindings that call the C API. The bindings that are part of this Github org all have 1.0 branches.

New tags

Sub-building tags

  • unit: an apartment, unit, office, lot, or other secondary unit designator
  • level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
  • staircase: numbered/lettered staircase
  • entrance: numbered/lettered entrance
  • po_box: post office box: typically found in non-physical (mail-only) addresses

Category tags

  • category: for category queries like "restaurants", etc.
  • near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"

New admin tags

  • island: named islands e.g. "Maui"
  • country_region: informal subdivision of a country without any political status
  • world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”

No more accent-stripping/transliteration of input

There's a new transliterator which only makes simple modifications to the input (HTML entity normalization, NFC unicode normalization). Latin-ASCII transliteration is no longer used at runtime. Instead, addresses are transliterated to multiple forms during training so the parser has to deal with all the variants rather than normalizing to a single variant (which previously was not even correct in cases like Finnish, Turkish, etc.) in both places.

Trained on > 1 billion examples in every inhabited country on Earth

The training data for libpostal's parser has been greatly expanded to include every country and dependency in OpenStreetMap. We also train on a places-only data set where every city name from OSM gets some representation even if there are no addresses (higher-population cities get examples proportional to their population). A similar training set is constructed for streets, so even places which have very few addresses but do have a road network in OSM can be included.

1.0 also moves beyond OSM, training on most of the data sets in OpenAddresses, and postal codes + associated admins from Yahoo's GeoPlanet, which includes virtually every postcode in the UK, Canada, etc.

Almost 100GB of public training data

All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

  • formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
  • formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
  • formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
  • geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
  • openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
  • uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

Better feature extraction

  • n-grams for the "unknown" words (occurred fewer than n times in the training set)
  • for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
  • an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
  • for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

More powerful machine learning model (CRF)

libpostal 1.0 uses a Conditional Random Field (CRF) instead of the greedy averaged perceptron. This more powerful machine learning method scores sequences rather than individual decisions, and can revise its previous decision if that would help a subsequent token score higher (Viterbi inference).

Improves upon the CRFsuite implementation in terms of:

  1. performance: Viterbi inference sped up by 2x
  2. scalability: training set doesn't need to fit in memory
  3. model expressiveness: libpostal's CRF adds state-transition features which can make use of both the state of the current token and the previous tag. These act just like normal features except their weights are LxL matrices (tags we could have transitioned from by tags we could transition to) instead of L vectors.

FTRL-Proximal optimization for the language classifier

The language classifier now uses a multinomial version of Google's FTRL-Proximal method, which uses a combination of L1 and L2 regularization, inducing sparsity while maintaining high accuracy. This results in a model that is more accurate than the previous classifier while being 1/10th the size. The runtime classifier is now able to load either sparse or dense weights depending on the file header.

Quito

09 Feb 20:04
Compare
Choose a tag to compare

Fixes small memory leak mentioned in #160

Ulaanbaatar

09 Jan 22:28
Compare
Choose a tag to compare

This release adds functions for configuring the datadir at runtime.

  • libpostal_setup_datadir(char *dir)
  • libpostal_setup_language_classifier_datadir(char *dir)
  • libpostal_setup_parser_datadir(char *dir)

Nagoya

20 Dec 19:04
Compare
Choose a tag to compare

Merged a few of the commits from the parser-data branch of libpostal into master to fix address parser training from the master branch.

Coincides with the release of some of the parser training data generated in the parser-data branch:

  1. OSM training addresses (27GB, ODBL)
    This is (a much-improved version of) the original data set used to train libpostal.
  2. OSM formatted place names/admins (4GB, ODBL)
    Helpful for making sure all the place names (cities, suburbs, etc.) in a country are part of the training set for libpostal, even if there are no addresses for that place.
  3. GeoPlanet postal codes with admins (11GB, CC-BY)
    Contains many postal codes from around the world, including the 1M+ postcodes in the UK, and their associated admins. If training on master, this may or may not help because it still relies pretty heavily on GeoNames for postcodes.
  4. OpenAddresses training addresses (30GB, various licenses)
    By far the largest data set. It's not every source from OpenAddresses, just the ones that are suitable for ingestion into libpostal. It's heavy on North America but also contains many of the EU countries. Most of the sources only require attribution, some have share-alike clauses. See openaddresses.io for more details.

Users are encouraged to QA the data for problematic patterns, etc. Note: while it's possible now to train per-country/language parsers using slices of the data, there will be no support offered for custom parsers.

Release is named after the largest train station the world: https://en.wikipedia.org/wiki/Nagoya_Station.

Ho Chi Minh

29 Aug 19:31
Compare
Choose a tag to compare
  • Data download script separates S3 files into 64MB chunks (as used in awscli) and uses a process pool
  • Various dictionary updates submitted by users

Oslo

26 May 17:04
Compare
Choose a tag to compare

Initial release of libpostal as described in the introductory blog post.