Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/o (care of) in addresses are identified as road #607

Open
futurewebpn opened this issue Nov 10, 2022 · 2 comments
Open

c/o (care of) in addresses are identified as road #607

futurewebpn opened this issue Nov 10, 2022 · 2 comments

Comments

@futurewebpn
Copy link

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is

Austria

Here's how I'm using libpostal

REST-API

Here's what I did

Futureweb GmbH, c/o Patrick Neuner, Innsbruckerstraße 7, 6380 St. Johann in Tirol, Österreich


Here's what I got

[{"label":"house","value":"futureweb gmbh"},{"label":"road","value":"c/o patrick neuner innsbruckerstraße"},{"label":"house_number","value":"7"},{"label":"postcode","value":"6380"},{"label":"city","value":"st. johann in tirol"},{"label":"country","value":"österreich"}]


Here's what I was expecting

c/o Patrick Neuner should be part of house and not part of road (or dedicated care of field).


For parsing issues, please answer "yes" or "no" to all that apply.

yes, but without c/o.

  • Do all the toponyms exist in OSM (city, state, region names, etc.)?

yes

  • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?

no, https://de.wikipedia.org/wiki/Zustellanweisung

  • If the address does not contain city, region, etc., does adding those fields to the input improve the result?

no, we tried removing/adding, as soon as c/o is used, it is road.

  • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

yes, removing c/o completly and it works.

Here's what I think could be improved

Adding c/o detection.

@futurewebpn
Copy link
Author

Might be it could also be with some initials that make this problem, also saw it happening with:
Futureweb GmbH, C.A. Patrick, Innsbruckerstraße 7, 6380 St. Johann in Tirol, Österreich
setting language doesn't change anything.

@albarrentine
Copy link
Contributor

Though many people seem to use this on company/mailing addresses, etc. it's not really trained with recipient information (maybe venue/business/POI names but not as much for mailing-specific details like individual recipients, divisions/departments, directions, etc.). In particular the training addresses we have come from OpenStreetMap which are usually not attached to individual people, just the address and maybe the venue/business name. I considered generating "c/o" information for the training set but it would mean using a data set that attaches people to addresses (lots of privacy concerns with that) or generating names, which is a pretty major task and most e.g. testing libraries that do it tend to be heavily biased toward American names, etc. so would have to find some sort of wide-coverage Census data to sample names, etc. when generating.

If it's mostly well-structured/comma-separated and in the same country, splitting out the "c/o" component with a simple regex could work. Another more generic way to do this without regex would be to try splitting by comma and moving backward through the string, parse the last phrase first, then from the second-to-last to the end, then the previous one til the end, etc. and track the labels and phrases until something changes, then throw out the phrase that created the inconsistency and keep moving.

For instance:

> Österreich

Result:

{
  "country": "österreich"
}

> 6380 St. Johann in Tirol, Österreich

Result:

{
  "postcode": "6380",
  "city": "st. johann in tirol",
  "country": "österreich"
}

> Innsbruckerstraße 7, 6380 St. Johann in Tirol, Österreich

Result:

{
  "road": "innsbruckerstraße",
  "house_number": "7",
  "postcode": "6380",
  "city": "st. johann in tirol",
  "country": "österreich"
}

> C.A. Patrick, Innsbruckerstraße 7, 6380 St. Johann in Tirol, Österreich

Result:

{
  "road": "c.a. patrick innsbruckerstraße",
  "house_number": "7",
  "postcode": "6380",
  "city": "st. johann in tirol",
  "country": "österreich"
}

> FutureWeb GmbH, Innsbruckerstraße 7, 6380 St. Johann in Tirol, Österreich

Result:

{
  "house": "futureweb gmbh",
  "road": "innsbruckerstraße",
  "house_number": "7",
  "postcode": "6380",
  "city": "st. johann in tirol",
  "country": "österreich"
}

Here, once you add "C.A. Patrick", the parse stops being consistent with what it returned previously. That could be because it's actually part of the road name, but if you're sure that each comma-separated phrase should be a distinct component (or maybe commas are fine within "house" but not other places), that might be a place to throw it out and continue through the rest of the phrases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants