Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve quadtree splitting rules #148

Open
ResidentMario opened this issue Jul 5, 2019 · 0 comments
Open

Improve quadtree splitting rules #148

ResidentMario opened this issue Jul 5, 2019 · 0 comments

Comments

@ResidentMario
Copy link
Owner

The current QuadTree implementation (the core of gplt.quadtree) only performs a split if every subpartition satisfies the splitting rules. This was done because this algorithm is easy to implement this way.

However, it has serious limitations for visualization purposes. For interpretability, the best splitting rule to use would be nmin which would guarantee that every sub-partition contains at least n observations. This combined with the current splitting rules is fine if the data is relatively homogeneously distributed. However, when there are gaps in the data due to the underlying topography, this causes the plot to miss splits, which ruins the whole "more splits means more data" thing gplt.quadtree has going on:

import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt

nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
collisions = gpd.read_file(gplt.datasets.get_path('nyc_collision_factors'))

ax = gplt.quadtree(
    collisions, nmin=1, projection=gcrs.AlbersEqualArea(),
    clip=nyc_boroughs,
    edgecolor='black', facecolor='None', zorder=2,
)
gplt.pointplot(collisions, s=1, ax=ax)

plt.title("New York Ciy Traffic Collisions, 2016")

foo

The documentation currently gets around this by recommending you use nmax instead, but nmax is just not that good a splitting rule:

  • It doesn't tell you anything about the number of observations in the partition, so it's less interpretable.
  • It's much less detailed overall, with the same volume of data.

Here's the nmax equivalent:

import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt

nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
collisions = gpd.read_file(gplt.datasets.get_path('nyc_collision_factors'))

ax = gplt.quadtree(
    collisions, nmax=1, projection=gcrs.AlbersEqualArea(),
    clip=nyc_boroughs,
    edgecolor='black', facecolor='None', zorder=2,
)
gplt.pointplot(collisions, s=1, ax=ax)

plt.title("New York Ciy Traffic Collisions, 2016")

foo

The solution is to change the partitioning algorithm. Instead of splitting when all four sub-partition meets the splitting criteria, consolidate contiguous sub-partitions that don't meet the criteria until you have n >= 2 that do. E.g. generate rectangular and/or L-shaped partitions.

There are a couple of reasons why this is an involved change:

  • The current code passes around (xmin, xmax, ymin, ymax) values, we now have to design an abstraction that handles irregularly shaped partitions instead.
  • This is compute-intensive code. The result has to be relatively well-optimized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant