Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"L10" at start of address not recognised as "Level" - level_types_numbered.txt ineffective? #615

Open
karanj opened this issue Feb 9, 2023 · 1 comment

Comments

@karanj
Copy link

karanj commented Feb 9, 2023

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Australia


Here's how I'm using libpostal

Parsing unstructured address data into structured data


Here's what I did

Parse L10, 10 Martin place sydney


Here's what I got

{
"road": "l10",
"house_number": "10",
"road": "martin place",
"city": "sydney"
}


Here's what I was expecting

{
"level": "level 10",
"house_number": "10",
"road": "martin place",
"city": "sydney"
}


For parsing issues, please answer "yes" or "no" to all that apply.

  • Does the input address exist in OpenStreetMap?
    No
  • Do all the toponyms exist in OSM (city, state, region names, etc.)?
    yes
  • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
    No
  • If the address does not contain city, region, etc., does adding those fields to the input improve the result?
    no
  • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

'lvl' seems to be the only form recognised as an abbreviation for Level

lvl 10 10 martin place sydney

Result:

{
"level": "lvl 10",
"house_number": "10",
"road": "martin place",
"city": "sydney"
}

Here's what I think could be improved

I would have expected given L by itself is in the level_types_numbered.txt dictionary for language EN that it would detect this as Level

@albarrentine
Copy link
Contributor

albarrentine commented Feb 15, 2024

"L" is included but is super ambiguous. Libpostal is not rule-based. The dictionaries inform a machine learning model which tries to optimize the global sequence of all predictions using features of the entire sequence as well as the sequence of predicted states from the model itself. In the address_parser cli you can inspect what the model is doing for each word in the input and where it might get stuck by typing .print_features and then trying the test case. This will print a set of the features used for each input word. "L" followed by a number in our training data can mean a lot of different things, often is a road. The road examples will occur naturally whereas we randomly generate unit information since apartment addresses are typically not included in OpenStreetMap/OpenAddresses, so there are probably enough counterexamples that outweigh it.

Since it's a simple pattern, you can always preprocess the input by e.g. replacing "\bl(?=.?\s*[\d]*)" with "lvl " or "level " creating "lvl 10", "level 10", etc. and reparse with that input instead. Or you can try parsing the input as given and then if it looks like there's a logical inconsistency (e.g. two spans marked as "road") then try the regex replace and parse again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants