Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect parsing of Irish addresses with Eircodes #656

Open
freyfogle opened this issue Feb 14, 2024 · 2 comments
Open

incorrect parsing of Irish addresses with Eircodes #656

freyfogle opened this issue Feb 14, 2024 · 2 comments

Comments

@freyfogle
Copy link

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Ireland


Here's how I'm using libpostal

Parsing addresses


Here's what I did

Tried to parse Irish addresses including Eircodes (relatively new Irish postcode format)

Example: Riverside House, Doneraile, P51 KT93, Ireland


Here's what I got

{
   "city" : "kt93",
   "country" : "ireland",
   "house" : "riverside house doneraile p51"
}

Here's what I was expecting

{
   "city" : "doneraile",
   "country" : "ireland",
   "house" : "riverside house",
   "postcode" : "kt93 p51"
}

For parsing issues, please answer "yes" or "no" to all that apply.

  • Does the input address exist in OpenStreetMap?
    no
  • Do all the toponyms exist in OSM (city, state, region names, etc.)?
    yes
  • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
    NA
  • If the address does not contain city, region, etc., does adding those fields to the input improve the result?
    removing the postcode leads to correct parsing
  • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?
    NA

Here's what I think could be improved

Eircodes are relatively new and only now coming into common use, especially for deliveries.
They are not yet widely found in OpenStreetMap.
Still, the format is easy to identify and the parser should be able to recognize them.

@albarrentine
Copy link
Contributor

Eircodes were just starting to roll out when it was initially trained but there were very few examples available as most people were using the old system. In a future version I've thought about adding UK/Irish/Canadian/any other similar postcodes directly to the tokenizer since they follow regular patterns that are unambiguous with other types, and then the model can just treat them as a single token and handle within a handful of type features instead of one for every normalized postcode-word (saves space as well, and those don't require geographic context so could remove them from the postcode index - which is stored efficiently as a trie but still clocks in at about 500MB), though that would muck with the weights and require a parser retraining, which is not planned for the very near future, though there's some rearchitecting going on in the background.

This style of postcode only partially benefits from the classic NLP features that are used such as word shapes/digit masks because those would normalize to something like ["pDD" "ktDD"]. With enough training data that can work even without observing every possible postcode, but the data would need to capture every pattern sans digits (for the UK/Canada there were also training examples built off of a somewhat exhaustive list that then gets normalized to word/digit shapes).

One workaround is just to extract/remove with regex before parsing since they do follow regular patterns.

@freyfogle
Copy link
Author

yes, we arrived at exactly the workaround you describe, just wanted to make sue you are aware that libpostal does not deal will with Eircodes.

Feel free to close the issue if you like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants