-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] initial draft of network weights #356
base: main
Are you sure you want to change the base?
Conversation
Some thoughts about handling situations where the origin or destination dataframes are polygons rather than points. Currently, I think the approach is to use the geometry centroid as the point geometry and snap that to the closest node of the network. One can imagine other strategies that might be more appropriate in other cases. For example
It would be interesting to include a mechanism to specify different strategies for assigning polygon data to the network. Perhaps a function that can be passed into the from_dataframe function with a signature similar to """
parameters
--------------
polygon: The origin or destination polygon + properties as a Data Frame
intersecting_nodes: A list of nodes that intersect with the polygon
returns
---------
A DataFrame of the nodes to be used with an associated weight value
"""
def assign_polygon_data_to_network(polygon,intersecting_nodes):
return nodes, weights An implementation of this function for centroid assignment would simply find the closest intersecting_node to the polygon centroid and return that with a weight of 1. While an implementation for assigning equal weights for each of the intersecting nodes would return the list of intersecting_nodes with This probably isn't quite the right way to do this but hopefully gets across the idea and can spark some discussion. |
Codecov Report
@@ Coverage Diff @@
## master #356 +/- ##
==========================================
- Coverage 81.40% 80.48% -0.92%
==========================================
Files 117 118 +1
Lines 11920 12075 +155
==========================================
+ Hits 9703 9719 +16
- Misses 2217 2356 +139
Continue to review full report at Codecov.
|
as @stuartlynn raises above, centroids are sketchy so will be cool to consider other options for attaching polys to the network |
This is what I did my dissertation on (see |
@jGaboardi that looks great! |
the error looks to be specific to windows/fiona/py3.6, but otherwise i think this is basically passing |
Yes, and I think specific to the feedstock. I opened an issue over there last week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that the most important part here will be the integration of NetworkW
with the rest of the ecosystem, i.e. the simple way of creating pandana.Network
from LineString GeoDataFrame, OSMnx-like networkx.Graph, momepy-like networkx.Graph, spaghetti.Network and likely some other options.
Btw, do we know what is the performance of pandana's shortest paths vs networkx? I guess it is much faster since it is vectorised C code, but I haven't found it.
libpysal/weights/network.py
Outdated
origins["osm_ids"] = network.get_node_ids( | ||
origins.centroid.x, origins.centroid.y | ||
).astype(int) | ||
destinations["osm_ids"] = network.get_node_ids( | ||
destinations.centroid.x, destinations.centroid.y | ||
).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check if origins == destinations
and copy osm_ids
in that case. I guess that this can be costly operation, so it is worth skipping if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know what is the performance of pandana's shortest paths vs networkx? I guess it is much faster since it is vectorised C code, but I haven't found it.
i think i can find it somewhere. Its not only that pandana is vectorized c, but it also uses contraction hierarchies instead of e.g. dijkstra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check if origins == destinations and copy osm_ids in that case. I guess that this can be costly operation, so it is worth skipping if we can.
thats a really good point. I originally wrote this for use with access, where origins and destinations aren't always the same, but in think in the context of a W
, they are, so we can skip the second call to get_node_ids
origins["osm_ids"] = network.get_node_ids( | |
origins.centroid.x, origins.centroid.y | |
).astype(int) | |
destinations["osm_ids"] = network.get_node_ids( | |
destinations.centroid.x, destinations.centroid.y | |
).astype(int) | |
origins["osm_ids"] = network.get_node_ids( | |
origins.centroid.x, origins.centroid.y | |
).astype(int) | |
destinations["osm_ids"] = network.get_node_ids( | |
destinations.centroid.x, destinations.centroid.y | |
).astype(int) |
destinations['idx'] = destinations.index.values | ||
index_dest = 'idx' | ||
|
||
# I dont think there's a way to do this in parallel, so we can at least show a progress bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have origins["osm_ids"]
as dask.Series
and apply over it in parallel, but that may be overkill. Isn't network.shortest_path_lengths
parallelised in pandana?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've been wondering about whether this is dask-able for massive datasets. Still very curious to see if it works/how performant it could be, so i may try out some explorations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially. You will always need to fit the whole Network
object in memory. But you can use dask to parallelize that for loop. I don't think it is possible to actually distribute this computation to a cluster, but on a single machine, the parallel loop may help. One way how to do that is here https://urbangrammarai.github.io/spatial_signatures/measuring/morphometrics.html#generate-spatial-weights-w
return feeds | ||
|
||
|
||
class NetworkW(W): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this PR is a WIP, but would it be possible to clean up the docstrings to conform with pysal/pysal#1174 once it is nearing non-WIP status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definiteky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also work on this once thing are more concrete.
libpysal/weights/network.py
Outdated
adj = compute_travel_cost_adjlist(df, df, network, index_orig=ids, index_dest=ids) | ||
if max_dist: | ||
adj = adj[adj['cost'] <= max_dist] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I just realised about this. adj
is a dense matrix. If you set max_dist
, you still have to create that dense matrix before, which means that you can quickly run out of memory. Would it make sense to do this check within the loop in compute_travel_cost_adjlist
? It may be less efficient performance-wise, but it would not blow up on memory.
I was just thinking about my data with hundreds of thousands of buildings for which I'd like to get network W, realising this would probably kill the machine.
On a related note, it may be worth checking the option to generate W on demand (for all constructors) to avoid these memory issues altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. This strategy works for sparse graphs, but any dense graph will have approximately n^2 entries. No fun....
To make these things built on the fly for a NetworkW, we'd need to make many attributes be cached properties. it's definitely doable in the same pattern as the rasterW GSOC project in #343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thanks. good point. I think moving that check up into the loop is probably the best way to go, but its worth noting that I sketched out a different way of doing this that used pandana's nearest_pois
functionality before they added the shortest_path
method.
The difference between the two approaches has to do with how/when the data get filtered. When using nearest_pois
, pandana searches out the specified distance and returns all the nodes and their distances that fall within. That means we have to set a parameter for the max total neighbors (which isnt terribly easy to estimate beforehand) and its calculating pairwise distances for every single node within the distance threshold
with shortest_paths
, we're first filtering out nodes we don't care about, only calculating pairwise distances between the objects we've snapped to the OSM network (so e.g. going up to census tracts, that changes the resolution dramatically), but then we have to filter by distance afterward. So in the first case, we're including irrelevant nodes inside the distance threshold, in the second case we're including irrelevant nodes outside the threshold. I'm also not sure how precomputation may potentially factor into the difference
the raison d'etre for pandana is, more or less, to create spatial lag variables that can be used in urbansim's hedonic and location choice models, and it's designed explicitly to work at the types of scales you describe @martinfleis --only now we have to figure out how to translate that same efficiency to generate the full W
object instead of just the lag. I wonder if @fscottfoti or @smmaurer have any thoughts about how best to do this? I think it could be useful for pandana users generally, because once you have the W
object generated from the pandana.Network
it lets you calculate different accessibility metrics or change the decay function/constant (e.g. using the generated travel matrix as input to access which would address issues like this) and those could be incorporated into the UDST workflow as well
libpysal/weights/network.py
Outdated
adj = compute_travel_cost_adjlist(df, df, network, index_orig=ids, index_dest=ids) | ||
if max_dist: | ||
adj = adj[adj['cost'] <= max_dist] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. This strategy works for sparse graphs, but any dense graph will have approximately n^2 entries. No fun....
To make these things built on the fly for a NetworkW, we'd need to make many attributes be cached properties. it's definitely doable in the same pattern as the rasterW GSOC project in #343
thanks to some fantastic help from the UDST folks, the next version of pandana will expose the vectorized range query, so this should reduce to a few lines of really fast code :). I'll take another pass once pandana 0.7 is out! |
this starts a new weights module that creates a neighbor relationship based on the shortest path through a pandana network.
Currently the weights are set to distance, so need to include some inverse distance weighting functions, but this is a start for generating feedback