New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically cache results from simple functions #7283
base: main
Are you sure you want to change the base?
Conversation
networkx/classes/graph.py
Outdated
self.graph = self.graph_attr_dict_factory() # dictionary for graph attributes | ||
self._node = self.node_dict_factory() # empty node attribute dict | ||
self._adj = self.adjlist_outer_dict_factory() # empty adjacency dict | ||
self.__networkx_cache__ = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add e.g. cache_dict_factory
class attribute and use that here just like how all the other dict attributes use factories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My immediate reaction is: No, we should not have a cache_dict_factory
-- unless we believe that someone would want to replace that dict with a user-grown cache object. I guess we can always imagine that. OK... so why is my reaction like that?
The dict_factories have been a pain to maintain and haven't resulted in the backend kind of behavior we envisioned could occur. Indeed, one could argue that with _dispatchable
we can now get rid of the dict_factories in the classes. They do serve different purposes, so maybe that argument wouldn't have a clear outcome. But much of the justification for adding them is becoming moot.
I see the cache dict as an easy/quick/flexible data structure, not a fancy data structure. But maybe I'm wrong. Do people see wanting a logging feature on the cache? Or something similar? If so, would the attribute be created more than once? That is, would we need a factory? One reason for the dict_factories is that each node and edge added creates a new dict -- and that is spread out in the code base. Changing the object from a dict to a special object would need to be done carefully with exhaustive searches because there are many ways to create new dicts and many places in nx where we e.g. create edges. I foresee us clear
ing the cache dict in many places, but not creating a new one. There is only a cache for the Graph, not for each edge of the graph.
But I might be missing something. Ideas welcome!
Woops messed up the merge conflict resolution. |
np, I'll try to redo the merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions which may or may not make much sense:
- why are we clearing the caching in
distance_measures.py:barycenter
. That does update the graph but it only updates a node attribute to hold the barycenter. - why/when do we clear all the contents of the cache as opposed to only clearing the items that depend on what we have changed?
- This approach of caching all and only function results that are scalars seems strange to me. The user can "cache" scalars in their own structures. What computations are we avoiding with this cache?
- This also feels like a wide net which could lead to confusion (though would be contained by the restriction to it being only for backend results). What are you envisioning this helping with? How does it improve GraphBLAS or cuGraph via NX? I think I've lost the big picture goal.
I was thinking of caching as being an option for a select few functions where intermediate values are computed and cached so that other functions could skip those computations. Is that what this PR is working toward too? I don't see these as intermediate results -- but I suspect my imagination hasn't seen something over the walls it is stuck between. :)
networkx/classes/graph.py
Outdated
self.graph = self.graph_attr_dict_factory() # dictionary for graph attributes | ||
self._node = self.node_dict_factory() # empty node attribute dict | ||
self._adj = self.adjlist_outer_dict_factory() # empty adjacency dict | ||
self.__networkx_cache__ = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My immediate reaction is: No, we should not have a cache_dict_factory
-- unless we believe that someone would want to replace that dict with a user-grown cache object. I guess we can always imagine that. OK... so why is my reaction like that?
The dict_factories have been a pain to maintain and haven't resulted in the backend kind of behavior we envisioned could occur. Indeed, one could argue that with _dispatchable
we can now get rid of the dict_factories in the classes. They do serve different purposes, so maybe that argument wouldn't have a clear outcome. But much of the justification for adding them is becoming moot.
I see the cache dict as an easy/quick/flexible data structure, not a fancy data structure. But maybe I'm wrong. Do people see wanting a logging feature on the cache? Or something similar? If so, would the attribute be created more than once? That is, would we need a factory? One reason for the dict_factories is that each node and edge added creates a new dict -- and that is spread out in the code base. Changing the object from a dict to a special object would need to be done carefully with exhaustive searches because there are many ways to create new dicts and many places in nx where we e.g. create edges. I foresee us clear
ing the cache dict in many places, but not creating a new one. There is only a cache for the Graph, not for each edge of the graph.
But I might be missing something. Ideas welcome!
More random thoughts still not fully formed about caching:
I think my main concern is "when do we clear the dict"? I don't want to avoid computing barycenter because it is going to clear my cache. I don't even want to think about losing cache results when I am calling a function to compute new values. With that in mind, clearing the dict for NetworkX caching is different from clearing it for backends. I'm sure they interact with each other -- for example if the NetworkX graph gets converted to a backend graph multiple times the second conversion would not include the cached backend computations. But let's make caching of the converted graph work so that isn't an issue. Then silo-ing the caches to the libraries that create the cache avoids collisions. Here's a probably-too-intricate proposal. What do people think?
Caching the backend converted graphs: (We definitely want this!)
Are there things I am missing here? Am I including things that we shouldn't be worrying about? Thoughts? |
I guess the most difficult issue to resolve is that if an edge attribute gets changed, we won’t know to clear the cache. |
How about if we create a hash from the graph including the edge attributes used by that function -- Something like (include any attributes used by that function): But backends might care about dtype. So they would need to find a way to make a cache_key that depends on the dtype and format. Maybe add the dtype string and the format string to tuple being hashed. I suppose we could do that in the networkx functions too. |
@eriknw and I discussed this offline the other day and I have some observations and thoughts thanks to @dschult 's thoughtful comments above. These only touch on a few of Dan's comments, and I'll have to think more about the others:
|
This PR introduces caching by using
G.__networkx_cache__
(which any backend graph can have). It only caches the simplest functions--a single (graph) argument that returns a scalar. One can imagine adding caching for other functions (we'd need to calculate a key from the input arguments). Also on the horizon (but probably not for this PR): automatically caching conversion to backend graphs.I think this PR needs #7191 to know which functions mutate inputs (or maybe we can look at
mutates_input=True
in #7191 and clear the cache manually when necessary).Config (#7225) may be useful as well to enable/disable caching.
Also related: #7233
CC @rlratzel