New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal PR to add cache dict to graphs and clear them #7344
Conversation
if attr is not None: | ||
G.__networkx_cache__.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think changing attribute values should cause the cache to be cleared.
And if we decide that it does cause the cache to be cleared, then we should change the functions that compute a property and add it as an attribute to instead compute the property and return it as a dict keyed by node/edge to a value. That dict information can be "easily" added to the graph if desired with nx.set_node_attributes
. But the idea that computing barycenter
or flows (or other properties) makes all the cached values go away doesn't seem right to me.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, node data is still part of the graph, and changing node data means changing the graph (which means clearing the cache if we want to be conservative and safe). Maybe caching can become smarter someday and selectively be cleared like you have suggested (for example, adding node data wouldn't change the number of connected components).
barycenter
is a special case, and I can't find another algorithm like it. Is there some reason why it's beneficial to add data to nodes within the algorithm? Are these values typically used elsewhere? Why don't any of the centralities (pagerank, betweenness, etc) use the same pattern? edmonds_karp_core
is the closest (and only other) function that's similar in that it updates the edges in the residual graph.
Even though I'm not very familiar with barycenter
, it does "smell" weird to me. I think it would be much more natural to return a dict with nodes to their barycentricity.
To find where input graph mutation occurs, you can search for mutates_input=
which is part of _dispatchable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably change barycenter to return a dict. But it also looks like the other two functions you treat specially in this PR store attributes on the graph that would not affect anything for cached values not directed related to those functions.
The main problem I see is that even if we track down all the code in NetworkX that changes the attributes in a graph, most likely people are changing data attributes outside of using the graph functions/methods. We can't track those changes. So, that limits caching quite a bit. We have to warn user (and rely on users to) not change aspects of the graph that the cached values depend on. Or -- we need a way to check if the values of the data have changed (like a hash). How do arrays do this? Are there array libraries that cache values based on the array?
Have you thought at all about caching the results on the backend graph object? Or on a middle-ground object between the backend graph and the networkx graph. For example, it could store the networkx graph as an attribute, and the cached objects in a dict _networkx_cache_
(don't dunders indicate python private attributes? so shouldn't we be using under
s? :)). We could have the networkx classes clear
the networkx cache, and the backend cache could look at that to see whether the network structure has changed. And it could use any other methods needed to check that the data values are the same.
I think making functions outside the base classes clear the cache is fooling ourselves into a false sense of confidence that caching will work. We need users to feel that it is a sneaky, slightly dangerous move that might save lots of time. So they should probably have to turn it on for a function by using a keyword arg, and we'll need the doc_strings for that keyword arg to warn them not to change the data values between calls. And it should probably point to a networkx doc page that describes caching and the potential dangers therein. That way we are still in "the wild west", where anything goes. But users can still take advantage of caching if they are careful. And ideally, users can easily figure out which changes are dangerous for caching.
@@ -50,4 +50,5 @@ def stochastic_graph(G, copy=True, weight="weight"): | |||
d[weight] = 0 | |||
else: | |||
d[weight] = d.get(weight, 1) / degree[u] | |||
G.__networkx_cache__.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the stochastic_graph
, I think the fix is to change from copy
making an entirely new graph, to allowing it to create a new attribute that holds the stochastic weight. There is no reason to copy the whole graph just to save the original weight values.
continued in #7345 |
An even simpler start to caching than #7283, this PR creates a
__networkx_cache__
dict on graph objects and clears them when appropriate.This takes a conservative approach to clear the cache in functions and methods whenever any graph data changes.
I'm pretty confident that all graph methods and dispatched functions clear the cache appropriately. There may be some non-dispatched functions that mutate the graph that aren't captured here.
This PR is minimal in that it doesn't do anything with the cache. By creating the cache, though, backends, users, and specific functions may begin experimenting with using the cache.