Improve quadtree splitting rules #148

ResidentMario · 2019-07-05T16:24:18Z

The current QuadTree implementation (the core of gplt.quadtree) only performs a split if every subpartition satisfies the splitting rules. This was done because this algorithm is easy to implement this way.

However, it has serious limitations for visualization purposes. For interpretability, the best splitting rule to use would be nmin which would guarantee that every sub-partition contains at least n observations. This combined with the current splitting rules is fine if the data is relatively homogeneously distributed. However, when there are gaps in the data due to the underlying topography, this causes the plot to miss splits, which ruins the whole "more splits means more data" thing gplt.quadtree has going on:

import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt

nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
collisions = gpd.read_file(gplt.datasets.get_path('nyc_collision_factors'))

ax = gplt.quadtree(
    collisions, nmin=1, projection=gcrs.AlbersEqualArea(),
    clip=nyc_boroughs,
    edgecolor='black', facecolor='None', zorder=2,
)
gplt.pointplot(collisions, s=1, ax=ax)

plt.title("New York Ciy Traffic Collisions, 2016")

The documentation currently gets around this by recommending you use nmax instead, but nmax is just not that good a splitting rule:

It doesn't tell you anything about the number of observations in the partition, so it's less interpretable.
It's much less detailed overall, with the same volume of data.

Here's the nmax equivalent:

import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs
import matplotlib.pyplot as plt

nyc_boroughs = gpd.read_file(gplt.datasets.get_path('nyc_boroughs'))
collisions = gpd.read_file(gplt.datasets.get_path('nyc_collision_factors'))

ax = gplt.quadtree(
    collisions, nmax=1, projection=gcrs.AlbersEqualArea(),
    clip=nyc_boroughs,
    edgecolor='black', facecolor='None', zorder=2,
)
gplt.pointplot(collisions, s=1, ax=ax)

plt.title("New York Ciy Traffic Collisions, 2016")

The solution is to change the partitioning algorithm. Instead of splitting when all four sub-partition meets the splitting criteria, consolidate contiguous sub-partitions that don't meet the criteria until you have n >= 2 that do. E.g. generate rectangular and/or L-shaped partitions.

There are a couple of reasons why this is an involved change:

The current code passes around (xmin, xmax, ymin, ymax) values, we now have to design an abstraction that handles irregularly shaped partitions instead.
This is compute-intensive code. The result has to be relatively well-optimized.

The text was updated successfully, but these errors were encountered:

ResidentMario added feature enhancement and removed feature labels Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve quadtree splitting rules #148

Improve quadtree splitting rules #148

ResidentMario commented Jul 5, 2019

Improve quadtree splitting rules #148

Improve quadtree splitting rules #148

Comments

ResidentMario commented Jul 5, 2019