Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically calculate country codes per quadkey & remove country_iso flag #29

Open
cholmes opened this issue Oct 9, 2023 · 2 comments
Labels
get_buildings Issues related to the get_buildings operations help wanted Extra attention is needed interesting Issues that are interesting problems that could be fun to take on
Milestone

Comments

@cholmes
Copy link
Collaborator

cholmes commented Oct 9, 2023

It seems like it should be possible to automatically include a country_iso to substantially speed up the query. The current method of having the user supply it is potentially error prone, and annoying.

The idea would be to calculate the list of country_iso values for every single quadkey. This would have to be a list, since quadkeys can cross countries, and big ones could have a hundred or more countries in them. But most should be one or a handful of countries, which will most always speed up the query.

There are 16 million quadkeys at level 12, but many are likely in the ocean. We likely could use a quadkey at level 10 or even 8, as having a couple more hive partitions to help wouldn't still make it worth it.

So I think the main thing would be to make a script that generates a list of country iso codes for every quadkey. Then store that as a parquet file, and if it's not too big we could likely just include it in the open_buildings package.

If we had this then we could remove the country_iso flag, as we'd be able to always use a hive partition.

@cholmes cholmes added get_buildings Issues related to the get_buildings operations interesting Issues that are interesting problems that could be fun to take on labels Oct 9, 2023
@cholmes cholmes added this to the 0.10.0 milestone Oct 9, 2023
@cholmes cholmes added the help wanted Extra attention is needed label Oct 9, 2023
@mtravis
Copy link
Contributor

mtravis commented Oct 9, 2023

@cholmes is it worth thinking about using the dataset-links.csv here https://github.com/microsoft/GlobalMLBuildingFootprints#what-does-the-data-include

We could cross ref that with another list that has the iso-2 codes perhaps?

@cholmes
Copy link
Collaborator Author

cholmes commented Oct 9, 2023

Interesting - yeah, that dataset will be perfect for querying the microsoft buildings as in #27 as it doesn't use iso-2 code, it uses the region name. So for that we need the lookup from quadkey to region name. I'm curious what that one does if the quadkey overlaps two regions - my guess is it just uses the centroid as they use region for the hive column so you wouldn't really want too. It's at zoom 9 I think, but good to know that it's only 6.3mb.

I think it may make sense to start mostly from scratch, as I think we want all possible countries (or regions) that are in a quadkey. Starting at zoom 9 probably makes good sense. I think the ideal is columns for iso-2, region and quadkey. I don't think it should be too hard to generate this with a script - just iterate through every quadkey and if it overlaps with one or more countries then add a row with the quadkey and the country codes. It should correspond exactly to whatever country file was used to create the data. That reminds me I should upload the one I've been using to source.coop - I just made it from overture using their recommended way in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
get_buildings Issues related to the get_buildings operations help wanted Extra attention is needed interesting Issues that are interesting problems that could be fun to take on
Projects
None yet
Development

No branches or pull requests

2 participants