Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal PR to add cache dict to graphs and clear them #7344

Closed
wants to merge 1 commit into from

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Mar 12, 2024

An even simpler start to caching than #7283, this PR creates a __networkx_cache__ dict on graph objects and clears them when appropriate.

This takes a conservative approach to clear the cache in functions and methods whenever any graph data changes.

I'm pretty confident that all graph methods and dispatched functions clear the cache appropriately. There may be some non-dispatched functions that mutate the graph that aren't captured here.

This PR is minimal in that it doesn't do anything with the cache. By creating the cache, though, backends, users, and specific functions may begin experimenting with using the cache.

Comment on lines +632 to +633
if attr is not None:
G.__networkx_cache__.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing attribute values should cause the cache to be cleared.

And if we decide that it does cause the cache to be cleared, then we should change the functions that compute a property and add it as an attribute to instead compute the property and return it as a dict keyed by node/edge to a value. That dict information can be "easily" added to the graph if desired with nx.set_node_attributes. But the idea that computing barycenter or flows (or other properties) makes all the cached values go away doesn't seem right to me.

Thoughts?

Copy link
Contributor Author

@eriknw eriknw Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, node data is still part of the graph, and changing node data means changing the graph (which means clearing the cache if we want to be conservative and safe). Maybe caching can become smarter someday and selectively be cleared like you have suggested (for example, adding node data wouldn't change the number of connected components).

barycenter is a special case, and I can't find another algorithm like it. Is there some reason why it's beneficial to add data to nodes within the algorithm? Are these values typically used elsewhere? Why don't any of the centralities (pagerank, betweenness, etc) use the same pattern? edmonds_karp_core is the closest (and only other) function that's similar in that it updates the edges in the residual graph.

Even though I'm not very familiar with barycenter, it does "smell" weird to me. I think it would be much more natural to return a dict with nodes to their barycentricity.

To find where input graph mutation occurs, you can search for mutates_input= which is part of _dispatchable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably change barycenter to return a dict. But it also looks like the other two functions you treat specially in this PR store attributes on the graph that would not affect anything for cached values not directed related to those functions.

The main problem I see is that even if we track down all the code in NetworkX that changes the attributes in a graph, most likely people are changing data attributes outside of using the graph functions/methods. We can't track those changes. So, that limits caching quite a bit. We have to warn user (and rely on users to) not change aspects of the graph that the cached values depend on. Or -- we need a way to check if the values of the data have changed (like a hash). How do arrays do this? Are there array libraries that cache values based on the array?

Have you thought at all about caching the results on the backend graph object? Or on a middle-ground object between the backend graph and the networkx graph. For example, it could store the networkx graph as an attribute, and the cached objects in a dict _networkx_cache_ (don't dunders indicate python private attributes? so shouldn't we be using unders? :)). We could have the networkx classes clear the networkx cache, and the backend cache could look at that to see whether the network structure has changed. And it could use any other methods needed to check that the data values are the same.

I think making functions outside the base classes clear the cache is fooling ourselves into a false sense of confidence that caching will work. We need users to feel that it is a sneaky, slightly dangerous move that might save lots of time. So they should probably have to turn it on for a function by using a keyword arg, and we'll need the doc_strings for that keyword arg to warn them not to change the data values between calls. And it should probably point to a networkx doc page that describes caching and the potential dangers therein. That way we are still in "the wild west", where anything goes. But users can still take advantage of caching if they are careful. And ideally, users can easily figure out which changes are dangerous for caching.

@@ -50,4 +50,5 @@ def stochastic_graph(G, copy=True, weight="weight"):
d[weight] = 0
else:
d[weight] = d.get(weight, 1) / degree[u]
G.__networkx_cache__.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the stochastic_graph, I think the fix is to change from copy making an entirely new graph, to allowing it to create a new attribute that holds the stochastic weight. There is no reason to copy the whole graph just to save the original weight values.

@MridulS
Copy link
Member

MridulS commented Mar 31, 2024

continued in #7345

@MridulS MridulS closed this Mar 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants