Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify on repeated highergeography #49

Open
tucotuco opened this issue Jul 29, 2022 · 0 comments
Open

Simplify on repeated highergeography #49

tucotuco opened this issue Jul 29, 2022 · 0 comments

Comments

@tucotuco
Copy link
Member

As with locality and verbatimLocality, there is often overlap in the content of higherGeography and the administrative geography fields. The former is often constructed from the latter using one pattern or another. For cases where higherGeography carries no information not found in the rest, ignoring highergeography could be a source of better matching.

There are 1,395,258 distinct highergeography values in gazetteer.locations_distinct_with_scores. Only 46,776 distinct highergeography records have nothing in the other geography fields.

What would be useful here is to ignore highergeography in matching strings when everything in the highergeography is in the geography fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant