Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CZECH REPUBLIC address parsing - Abbreviation in House Name is parsed wrongly as "City" #653

Open
souravsinhacse opened this issue Jan 31, 2024 · 1 comment

Comments

@souravsinhacse
Copy link

I am using the Docker image https://hub.docker.com/r/clicksend/libpostal-rest

For CZECH REPUBLIC address parsing - Abbreviation in House Name is parsed wrongly as "City"

curl -X POST -d '{"query": "Namesti Republicky 2090/3a PRAHA HVB BANK CZECH REPUBLIC A.S. CZECH REPUBLIC"}' localhost:8080/parser

[{"label":"road","value":"namesti republicky"},
{"label":"house_number","value":"2090/3a"},
{"label":"city","value":"praha"},
{"label":"house","value":"hvb bank czech republic"},
{"label":"city","value":"a.s."},
{"label":"country","value":"czech republic"}]

@albarrentine
Copy link
Contributor

probably the structure is not in the training data for Czech Republic, this works and is closer to the addresses it trains on:

"HVB BANK CZECH REPUBLIC A.S. Namesti Republicky 2090/3a PRAHA CZECH REPUBLIC"

You can programmatically check for if you get, for instance two different city tokens or a company suffix as a city and then try rearranging and reparse, or if the pattern happens often can just concat "a.s." when it's encountered as a "city" onto the previous "house" string. Model parses are never going to be perfect in every case so often good to encode some guardrails or forward the few that seem to have logical inconsistencies (like two different city tokens) for manual review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants