Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postcode recognition in France #638

Open
lamquiem opened this issue Aug 17, 2023 · 2 comments
Open

Postcode recognition in France #638

lamquiem opened this issue Aug 17, 2023 · 2 comments

Comments

@lamquiem
Copy link

Hi!

I was checking out libpostal, and saw something that could be improved.


My country is France


Here's how I'm using libpostal

We use libpostal to parse addresses before searching with elasticsearch.

Here's what I did

parse_address('1 rue saint roch 2B238 poggio-di-venaco',language = 'fr', country = 'fr')


Here's what I got

[('1', 'house_number'),
('rue saint roch 2b238', 'road'),
('poggio-di-venaco', 'city')]


Here's what I was expecting

[('1', 'house_number'),
('rue saint roch', 'road'),
('2b238','postcode'),
('poggio-di-venaco', 'city')]


For parsing issues, please answer "yes" or "no" to all that apply.

  • Does the input address exist in OpenStreetMap?
    no
  • Do all the toponyms exist in OSM (city, state, region names, etc.)?
    yes
  • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
    no

Here's what I think could be improved

Is it possible to specify that French postcodes are of the form (\d[0-9aAbB]\d{3}) when parsing?
The codes '2A' and '2B' correspond to the two Corsican departments in France. Openstreet map treats them as '20' but this is not the reality.
Is it possible to set libpostal to recognise this form of regex ?

@prigaux
Copy link

prigaux commented Jan 23, 2024

https://fr.wikipedia.org/wiki/Poggio-di-Venaco says postcode is 20250. 2B238 seems to be the INSEE code ?

@albarrentine
Copy link
Contributor

albarrentine commented Feb 14, 2024

yes guessing that postcode format doesn't exist in the training data (you can type .print_features in the address_parser cli and then try an address to see what the model is doing and where it might get stuck). Libpostal is not based on regex, other than to split strings into words. Using 20250 works for instance because it is a common postcode format and we also have some geographic context dictionaries which help identify postal codes from known geographic contexts (which probably include the 20250 version as well).

1 rue saint roch 20250 poggio-di-venaco

{
  "house_number": "1",
  "road": "rue saint roch",
  "postcode": "20250",
  "city": "poggio-di-venaco"
}

You can use a regex to extract/remove postcodes following that pattern and reparse the remainder, e.g. something like this will usually also work. If you're sending to Elasticsearch, you can just add the extracted postcode back in if needed for ElasticSearch purposes (postcode may be more selective than city, etc).

1 rue saint roch poggio-di-venaco

{
  "house_number": "1",
  "road": "rue saint roch",
  "city": "poggio-di-venaco"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants