Minimal PR to add cache dict to graphs and clear them #7344

eriknw · 2024-03-12T01:26:06Z

An even simpler start to caching than #7283, this PR creates a __networkx_cache__ dict on graph objects and clears them when appropriate.

This takes a conservative approach to clear the cache in functions and methods whenever any graph data changes.

I'm pretty confident that all graph methods and dispatched functions clear the cache appropriately. There may be some non-dispatched functions that mutate the graph that aren't captured here.

This PR is minimal in that it doesn't do anything with the cache. By creating the cache, though, backends, users, and specific functions may begin experimenting with using the cache.

dschult · 2024-03-12T15:42:28Z

networkx/algorithms/distance_measures.py

+    if attr is not None:
+        G.__networkx_cache__.clear()


I don't think changing attribute values should cause the cache to be cleared.

And if we decide that it does cause the cache to be cleared, then we should change the functions that compute a property and add it as an attribute to instead compute the property and return it as a dict keyed by node/edge to a value. That dict information can be "easily" added to the graph if desired with nx.set_node_attributes. But the idea that computing barycenter or flows (or other properties) makes all the cached values go away doesn't seem right to me.

Thoughts?

To me, node data is still part of the graph, and changing node data means changing the graph (which means clearing the cache if we want to be conservative and safe). Maybe caching can become smarter someday and selectively be cleared like you have suggested (for example, adding node data wouldn't change the number of connected components).

barycenter is a special case, and I can't find another algorithm like it. Is there some reason why it's beneficial to add data to nodes within the algorithm? Are these values typically used elsewhere? Why don't any of the centralities (pagerank, betweenness, etc) use the same pattern? edmonds_karp_core is the closest (and only other) function that's similar in that it updates the edges in the residual graph.

Even though I'm not very familiar with barycenter, it does "smell" weird to me. I think it would be much more natural to return a dict with nodes to their barycentricity.

To find where input graph mutation occurs, you can search for mutates_input= which is part of _dispatchable.

I think we should probably change barycenter to return a dict. But it also looks like the other two functions you treat specially in this PR store attributes on the graph that would not affect anything for cached values not directed related to those functions.

The main problem I see is that even if we track down all the code in NetworkX that changes the attributes in a graph, most likely people are changing data attributes outside of using the graph functions/methods. We can't track those changes. So, that limits caching quite a bit. We have to warn user (and rely on users to) not change aspects of the graph that the cached values depend on. Or -- we need a way to check if the values of the data have changed (like a hash). How do arrays do this? Are there array libraries that cache values based on the array?

Have you thought at all about caching the results on the backend graph object? Or on a middle-ground object between the backend graph and the networkx graph. For example, it could store the networkx graph as an attribute, and the cached objects in a dict _networkx_cache_ (don't dunders indicate python private attributes? so shouldn't we be using unders? :)). We could have the networkx classes clear the networkx cache, and the backend cache could look at that to see whether the network structure has changed. And it could use any other methods needed to check that the data values are the same.

I think making functions outside the base classes clear the cache is fooling ourselves into a false sense of confidence that caching will work. We need users to feel that it is a sneaky, slightly dangerous move that might save lots of time. So they should probably have to turn it on for a function by using a keyword arg, and we'll need the doc_strings for that keyword arg to warn them not to change the data values between calls. And it should probably point to a networkx doc page that describes caching and the potential dangers therein. That way we are still in "the wild west", where anything goes. But users can still take advantage of caching if they are careful. And ideally, users can easily figure out which changes are dangerous for caching.

dschult · 2024-03-13T15:19:17Z

networkx/generators/stochastic.py

@@ -50,4 +50,5 @@ def stochastic_graph(G, copy=True, weight="weight"):
            d[weight] = 0
        else:
            d[weight] = d.get(weight, 1) / degree[u]
+    G.__networkx_cache__.clear()


For the stochastic_graph, I think the fix is to change from copy making an entirely new graph, to allowing it to create a new attribute that holds the stochastic weight. There is no reason to copy the whole graph just to save the original weight values.

MridulS · 2024-03-31T16:15:19Z

continued in #7345

Minimal PR to add cache dict to graphs and clear them

9935b33

MridulS added the type: Enhancements label Mar 12, 2024

dschult reviewed Mar 12, 2024

View reviewed changes

eriknw mentioned this pull request Mar 13, 2024

ENH: Cache graphs objects when converting to a backend #7345

Merged

dschult reviewed Mar 13, 2024

View reviewed changes

MridulS closed this Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal PR to add cache dict to graphs and clear them #7344

Minimal PR to add cache dict to graphs and clear them #7344

eriknw commented Mar 12, 2024

dschult Mar 12, 2024

eriknw Mar 12, 2024 •

edited

dschult Mar 13, 2024

dschult Mar 13, 2024

MridulS commented Mar 31, 2024

Minimal PR to add cache dict to graphs and clear them #7344

Minimal PR to add cache dict to graphs and clear them #7344

Conversation

eriknw commented Mar 12, 2024

dschult Mar 12, 2024

Choose a reason for hiding this comment

eriknw Mar 12, 2024 • edited

Choose a reason for hiding this comment

dschult Mar 13, 2024

Choose a reason for hiding this comment

dschult Mar 13, 2024

Choose a reason for hiding this comment

MridulS commented Mar 31, 2024

eriknw Mar 12, 2024 •

edited