Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] initial draft of network weights #356

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

knaaptime
Copy link
Member

this starts a new weights module that creates a neighbor relationship based on the shortest path through a pandana network.

Currently the weights are set to distance, so need to include some inverse distance weighting functions, but this is a start for generating feedback

@stuartlynn
Copy link

Some thoughts about handling situations where the origin or destination dataframes are polygons rather than points.

Currently, I think the approach is to use the geometry centroid as the point geometry and snap that to the closest node of the network.

One can imagine other strategies that might be more appropriate in other cases. For example

  1. Using a representative point on the polygon instead of a centroid
  2. If the origin polygons have population like data and we can assume that data is uniform across the polygon, assigning some weighted fraction of the orgin polygon data to each network node that the polygon intersects.

It would be interesting to include a mechanism to specify different strategies for assigning polygon data to the network. Perhaps a function that can be passed into the from_dataframe function with a signature similar to

"""
parameters
--------------
polygon:  The origin or destination polygon  + properties as a Data Frame
intersecting_nodes: A list of nodes that intersect with the polygon 

returns
---------
A DataFrame of the nodes to be used with an associated weight value
"""

def assign_polygon_data_to_network(polygon,intersecting_nodes):
   return nodes, weights

An implementation of this function for centroid assignment would simply find the closest intersecting_node to the polygon centroid and return that with a weight of 1.

While an implementation for assigning equal weights for each of the intersecting nodes would return the list of intersecting_nodes with 1/len(intersecting_nodes) as the weights for each.

This probably isn't quite the right way to do this but hopefully gets across the idea and can spark some discussion.

@codecov
Copy link

codecov bot commented Oct 25, 2020

Codecov Report

Merging #356 into master will decrease coverage by 0.91%.
The diff coverage is 12.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #356      +/-   ##
==========================================
- Coverage   81.40%   80.48%   -0.92%     
==========================================
  Files         117      118       +1     
  Lines       11920    12075     +155     
==========================================
+ Hits         9703     9719      +16     
- Misses       2217     2356     +139     
Impacted Files Coverage Δ
libpysal/weights/network.py 11.23% <11.23%> (ø)
libpysal/weights/__init__.py 100.00% <100.00%> (ø)
libpysal/cg/shapely_ext.py 19.01% <0.00%> (-4.92%) ⬇️
libpysal/cg/locators.py 60.26% <0.00%> (-0.29%) ⬇️
libpysal/io/util/tests/test_weight_converter.py 16.12% <0.00%> (-0.18%) ⬇️
libpysal/cg/tests/test_rtree.py 95.65% <0.00%> (-0.10%) ⬇️
libpysal/cg/tests/test_locators.py 96.22% <0.00%> (-0.07%) ⬇️
libpysal/common.py 82.69% <0.00%> (ø)
libpysal/cg/rtree.py 91.64% <0.00%> (ø)
libpysal/cg/kdtree.py 67.85% <0.00%> (ø)
... and 97 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f24ebba...e1269a6. Read the comment docs.

@knaaptime
Copy link
Member Author

as @stuartlynn raises above, centroids are sketchy so will be cool to consider other options for attaching polys to the network

@jGaboardi
Copy link
Member

jGaboardi commented Oct 25, 2020

as @stuartlynn raises above, centroids are sketchy so will be cool to consider other options for attaching polys to the network

This is what I did my dissertation on (see pp2n). I am planning on making it part of spaghetti eventually. Definitely here, too, if desired.

@stuartlynn
Copy link

@jGaboardi that looks great!

@knaaptime
Copy link
Member Author

the error looks to be specific to windows/fiona/py3.6, but otherwise i think this is basically passing

@jGaboardi
Copy link
Member

the error looks to be specific to windows/fiona/py3.6, but otherwise i think this is basically passing

Yes, and I think specific to the feedstock. I opened an issue over there last week.

conda-forge/fiona-feedstock#171

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the most important part here will be the integration of NetworkW with the rest of the ecosystem, i.e. the simple way of creating pandana.Network from LineString GeoDataFrame, OSMnx-like networkx.Graph, momepy-like networkx.Graph, spaghetti.Network and likely some other options.

Btw, do we know what is the performance of pandana's shortest paths vs networkx? I guess it is much faster since it is vectorised C code, but I haven't found it.

Comment on lines 111 to 116
origins["osm_ids"] = network.get_node_ids(
origins.centroid.x, origins.centroid.y
).astype(int)
destinations["osm_ids"] = network.get_node_ids(
destinations.centroid.x, destinations.centroid.y
).astype(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check if origins == destinations and copy osm_ids in that case. I guess that this can be costly operation, so it is worth skipping if we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know what is the performance of pandana's shortest paths vs networkx? I guess it is much faster since it is vectorised C code, but I haven't found it.

i think i can find it somewhere. Its not only that pandana is vectorized c, but it also uses contraction hierarchies instead of e.g. dijkstra

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check if origins == destinations and copy osm_ids in that case. I guess that this can be costly operation, so it is worth skipping if we can.

thats a really good point. I originally wrote this for use with access, where origins and destinations aren't always the same, but in think in the context of a W, they are, so we can skip the second call to get_node_ids

Suggested change
origins["osm_ids"] = network.get_node_ids(
origins.centroid.x, origins.centroid.y
).astype(int)
destinations["osm_ids"] = network.get_node_ids(
destinations.centroid.x, destinations.centroid.y
).astype(int)
origins["osm_ids"] = network.get_node_ids(
origins.centroid.x, origins.centroid.y
).astype(int)
destinations["osm_ids"] = network.get_node_ids(
destinations.centroid.x, destinations.centroid.y
).astype(int)

destinations['idx'] = destinations.index.values
index_dest = 'idx'

# I dont think there's a way to do this in parallel, so we can at least show a progress bar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have origins["osm_ids"] as dask.Series and apply over it in parallel, but that may be overkill. Isn't network.shortest_path_lengths parallelised in pandana?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've been wondering about whether this is dask-able for massive datasets. Still very curious to see if it works/how performant it could be, so i may try out some explorations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially. You will always need to fit the whole Network object in memory. But you can use dask to parallelize that for loop. I don't think it is possible to actually distribute this computation to a cluster, but on a single machine, the parallel loop may help. One way how to do that is here https://urbangrammarai.github.io/spatial_signatures/measuring/morphometrics.html#generate-spatial-weights-w

return feeds


class NetworkW(W):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this PR is a WIP, but would it be possible to clean up the docstrings to conform with pysal/pysal#1174 once it is nearing non-WIP status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definiteky

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also work on this once thing are more concrete.

@jGaboardi jGaboardi added enhancement WIP Work in progress, do not merge. Discussion only. labels Oct 26, 2020
Comment on lines 79 to 81
adj = compute_travel_cost_adjlist(df, df, network, index_orig=ids, index_dest=ids)
if max_dist:
adj = adj[adj['cost'] <= max_dist]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I just realised about this. adj is a dense matrix. If you set max_dist, you still have to create that dense matrix before, which means that you can quickly run out of memory. Would it make sense to do this check within the loop in compute_travel_cost_adjlist? It may be less efficient performance-wise, but it would not blow up on memory.

I was just thinking about my data with hundreds of thousands of buildings for which I'd like to get network W, realising this would probably kill the machine.

On a related note, it may be worth checking the option to generate W on demand (for all constructors) to avoid these memory issues altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. This strategy works for sparse graphs, but any dense graph will have approximately n^2 entries. No fun....

To make these things built on the fly for a NetworkW, we'd need to make many attributes be cached properties. it's definitely doable in the same pattern as the rasterW GSOC project in #343

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks. good point. I think moving that check up into the loop is probably the best way to go, but its worth noting that I sketched out a different way of doing this that used pandana's nearest_pois functionality before they added the shortest_path method.

The difference between the two approaches has to do with how/when the data get filtered. When using nearest_pois, pandana searches out the specified distance and returns all the nodes and their distances that fall within. That means we have to set a parameter for the max total neighbors (which isnt terribly easy to estimate beforehand) and its calculating pairwise distances for every single node within the distance threshold

with shortest_paths, we're first filtering out nodes we don't care about, only calculating pairwise distances between the objects we've snapped to the OSM network (so e.g. going up to census tracts, that changes the resolution dramatically), but then we have to filter by distance afterward. So in the first case, we're including irrelevant nodes inside the distance threshold, in the second case we're including irrelevant nodes outside the threshold. I'm also not sure how precomputation may potentially factor into the difference

the raison d'etre for pandana is, more or less, to create spatial lag variables that can be used in urbansim's hedonic and location choice models, and it's designed explicitly to work at the types of scales you describe @martinfleis --only now we have to figure out how to translate that same efficiency to generate the full W object instead of just the lag. I wonder if @fscottfoti or @smmaurer have any thoughts about how best to do this? I think it could be useful for pandana users generally, because once you have the W object generated from the pandana.Network it lets you calculate different accessibility metrics or change the decay function/constant (e.g. using the generated travel matrix as input to access which would address issues like this) and those could be incorporated into the UDST workflow as well

libpysal/weights/network.py Outdated Show resolved Hide resolved
Comment on lines 79 to 81
adj = compute_travel_cost_adjlist(df, df, network, index_orig=ids, index_dest=ids)
if max_dist:
adj = adj[adj['cost'] <= max_dist]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. This strategy works for sparse graphs, but any dense graph will have approximately n^2 entries. No fun....

To make these things built on the fly for a NetworkW, we'd need to make many attributes be cached properties. it's definitely doable in the same pattern as the rasterW GSOC project in #343

@knaaptime
Copy link
Member Author

thanks to some fantastic help from the UDST folks, the next version of pandana will expose the vectorized range query, so this should reduce to a few lines of really fast code :). I'll take another pass once pandana 0.7 is out!

@martinfleis martinfleis changed the base branch from master to main February 27, 2023 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement weights WIP Work in progress, do not merge. Discussion only.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants