Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically cache results from simple functions #7283

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Feb 10, 2024

This PR introduces caching by using G.__networkx_cache__ (which any backend graph can have). It only caches the simplest functions--a single (graph) argument that returns a scalar. One can imagine adding caching for other functions (we'd need to calculate a key from the input arguments). Also on the horizon (but probably not for this PR): automatically caching conversion to backend graphs.

I think this PR needs #7191 to know which functions mutate inputs (or maybe we can look at mutates_input=True in #7191 and clear the cache manually when necessary).

Config (#7225) may be useful as well to enable/disable caching.

Also related: #7233

CC @rlratzel

@dschult dschult added type: Enhancements Dispatching Related to dispatching and backend support labels Feb 18, 2024
Comment on lines 365 to 368
self.graph = self.graph_attr_dict_factory() # dictionary for graph attributes
self._node = self.node_dict_factory() # empty node attribute dict
self._adj = self.adjlist_outer_dict_factory() # empty adjacency dict
self.__networkx_cache__ = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add e.g. cache_dict_factory class attribute and use that here just like how all the other dict attributes use factories?

Copy link
Member

@dschult dschult Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My immediate reaction is: No, we should not have a cache_dict_factory -- unless we believe that someone would want to replace that dict with a user-grown cache object. I guess we can always imagine that. OK... so why is my reaction like that?

The dict_factories have been a pain to maintain and haven't resulted in the backend kind of behavior we envisioned could occur. Indeed, one could argue that with _dispatchable we can now get rid of the dict_factories in the classes. They do serve different purposes, so maybe that argument wouldn't have a clear outcome. But much of the justification for adding them is becoming moot.

I see the cache dict as an easy/quick/flexible data structure, not a fancy data structure. But maybe I'm wrong. Do people see wanting a logging feature on the cache? Or something similar? If so, would the attribute be created more than once? That is, would we need a factory? One reason for the dict_factories is that each node and edge added creates a new dict -- and that is spread out in the code base. Changing the object from a dict to a special object would need to be done carefully with exhaustive searches because there are many ways to create new dicts and many places in nx where we e.g. create edges. I foresee us clearing the cache dict in many places, but not creating a new one. There is only a cache for the Graph, not for each edge of the graph.

But I might be missing something. Ideas welcome!

@MridulS
Copy link
Member

MridulS commented Feb 28, 2024

Woops messed up the merge conflict resolution.

@eriknw
Copy link
Contributor Author

eriknw commented Feb 28, 2024

np, I'll try to redo the merge.

Copy link
Member

@dschult dschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions which may or may not make much sense:

  • why are we clearing the caching in distance_measures.py:barycenter. That does update the graph but it only updates a node attribute to hold the barycenter.
  • why/when do we clear all the contents of the cache as opposed to only clearing the items that depend on what we have changed?
  • This approach of caching all and only function results that are scalars seems strange to me. The user can "cache" scalars in their own structures. What computations are we avoiding with this cache?
  • This also feels like a wide net which could lead to confusion (though would be contained by the restriction to it being only for backend results). What are you envisioning this helping with? How does it improve GraphBLAS or cuGraph via NX? I think I've lost the big picture goal.

I was thinking of caching as being an option for a select few functions where intermediate values are computed and cached so that other functions could skip those computations. Is that what this PR is working toward too? I don't see these as intermediate results -- but I suspect my imagination hasn't seen something over the walls it is stuck between. :)

Comment on lines 365 to 368
self.graph = self.graph_attr_dict_factory() # dictionary for graph attributes
self._node = self.node_dict_factory() # empty node attribute dict
self._adj = self.adjlist_outer_dict_factory() # empty adjacency dict
self.__networkx_cache__ = {}
Copy link
Member

@dschult dschult Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My immediate reaction is: No, we should not have a cache_dict_factory -- unless we believe that someone would want to replace that dict with a user-grown cache object. I guess we can always imagine that. OK... so why is my reaction like that?

The dict_factories have been a pain to maintain and haven't resulted in the backend kind of behavior we envisioned could occur. Indeed, one could argue that with _dispatchable we can now get rid of the dict_factories in the classes. They do serve different purposes, so maybe that argument wouldn't have a clear outcome. But much of the justification for adding them is becoming moot.

I see the cache dict as an easy/quick/flexible data structure, not a fancy data structure. But maybe I'm wrong. Do people see wanting a logging feature on the cache? Or something similar? If so, would the attribute be created more than once? That is, would we need a factory? One reason for the dict_factories is that each node and edge added creates a new dict -- and that is spread out in the code base. Changing the object from a dict to a special object would need to be done carefully with exhaustive searches because there are many ways to create new dicts and many places in nx where we e.g. create edges. I foresee us clearing the cache dict in many places, but not creating a new one. There is only a cache for the Graph, not for each edge of the graph.

But I might be missing something. Ideas welcome!

@dschult
Copy link
Member

dschult commented Mar 8, 2024

More random thoughts still not fully formed about caching:

  • in terms of timing -- this caching feature could be provided by the backend graph object. There's no need to have this in NetworkX to get the speed-ups this would provide for a backend. So i think applying a similar PR to this one on the backend graph object could relieve any timing pressure on this PR.
  • An advantage of backend caching is that it silos the cache. When the barycenter function clears the cache, it only clears the cache of values stored for that backend -- not all the cached values for all possible backends.

I think my main concern is "when do we clear the dict"? I don't want to avoid computing barycenter because it is going to clear my cache. I don't even want to think about losing cache results when I am calling a function to compute new values.

With that in mind, clearing the dict for NetworkX caching is different from clearing it for backends. I'm sure they interact with each other -- for example if the NetworkX graph gets converted to a backend graph multiple times the second conversion would not include the cached backend computations. But let's make caching of the converted graph work so that isn't an issue. Then silo-ing the caches to the libraries that create the cache avoids collisions.

Here's a probably-too-intricate proposal. What do people think?

  • Store cache values based on what kind of changes will cause it to be cleared:
    • clear_for_node_or_edge_changes, clear_for_node_changes_only. I don't think we want an "edge-only" because edge computations are so intertwined with changing nodes.
    • add_node would clear these. add_edge would clear only the first one.
    • One could imagine adding categories for changes to attributes: clear_for_node_attr_changes, clear_for_edge_attr_changes. But I don't think the base classes should clear anything based on changes to the attribute values. The base classes have no way to track the attribute values. Attribute dicts are designed to be independent of the graph structure, and should be tracked in other ways. So I would prefer not to have these categories.
    • One could imagine a keep_even_for_structure_changes group would be for values that functions want to store across function calls and they shouldn't be cleared between calls. I'm not sure what the motive would be, but it ensures the value stays around.
  • The cached values could be stored as separate Graph attributes, or as sub-dicts within a cache dict attribute.
  • Individual functions (nx or backend functions -- or user functions) would decide which dict their cached value gets stored which indicates which base class changes clear the value. Generally, functions should not clear other function's cached values, and the cached key should be chosen to avoid collisions. Groups of functions could clear cached values of other functions in the group where appropriate, but this should happen rarely and only when functions are clearly part of the same group/module/subpackage.

Caching the backend converted graphs: (We definitely want this!)

  • We may be able to silo the caching via the NetworkX cache dict by a convention where backends put a cache dict into the NetworkX cache-dict. And they don't clear the networkx cache-dict... only the dict they construct for their backend.
  • But I think wherever possible, it would be better to cache the values on the backend's graph object.
    • there is clear control when we store cached values on the graph object that is being calculated against.
    • One backend won't clear the other backend's values.
    • There is no confusion when a backend changes it's graph object. (control is separated).
    • This removes a hefty weight from the networkx graph cache when multiple backends are used.
  • Caching the backend converted graph on the NetworkX graph is encouraged. I envision a function like nx_parallel_convert_from_nx would store a single key/value in the networkx cache indicating clear_for_node_or_edge_changes.
    • The key/value pair could be something like: nx_parallel_converted_graph: G_parallel.
    • The backend would be in charge of clearing that specific link if and when a node or edge is changed on the backend graph object.
    • NetworkX would clear the link between the two when the nx.Graph gets changed.

Are there things I am missing here? Am I including things that we shouldn't be worrying about? Thoughts?

@dschult
Copy link
Member

dschult commented Mar 8, 2024

I guess the most difficult issue to resolve is that if an edge attribute gets changed, we won’t know to clear the cache.

@dschult
Copy link
Member

dschult commented Mar 10, 2024

How about if we create a hash from the graph including the edge attributes used by that function --
The idea would be to not clear the cache at the function level. But to use it if the hash matches.

Something like (include any attributes used by that function):
scipy_sparse_cache_key = hash(tuple(G.nodes) + tuple(G.edges(data="weight", default=1)))
This doesn't keep track of things like dtype and format. But I don't think we'll need to. None of our algorithms depend on the dtype.

But backends might care about dtype. So they would need to find a way to make a cache_key that depends on the dtype and format. Maybe add the dtype string and the format string to tuple being hashed. I suppose we could do that in the networkx functions too.

@rlratzel
Copy link

rlratzel commented Mar 13, 2024

@eriknw and I discussed this offline the other day and I have some observations and thoughts thanks to @dschult 's thoughtful comments above. These only touch on a few of Dan's comments, and I'll have to think more about the others:

  • I think the initial approach is to be very conservative with caching and err on the side of re-computing. This should hopefully give us more assurance of correctness, and at worst we'll have the same performance. I'm assuming we'll start optimizing by doing more intelligent cache clearing when this initial implementation has more mileage. I think that's why Erik is clearing on a call to barycenter since it changes the Graph.
  • I think the initial approach of caching scalars is to also help us ease into caching in NX, and prove out the new infrastructure. I'm assuming caching scalars still has value for things like naïve/unoptimized user code that calls something like number_connected_components repeatedly?
  • I don't think this PR covers things like caching intermediate results, and instead allows for just caching the return values of user-facing functions that are safe to cache. Although, I'm guessing that code which accesses the G.__networkx_cache__ directly can use it to cache intermediate results, so maybe this (and Mridul's POC PR) does indeed introduce the mechanism for that too?
  • I agree that users shouldn't start making design decisions ("shoud I call barycenter or not") based on the cache. I know that's ideal world thinking though, but I wonder if this is more about proposing a change to barycenter to not modify the Graph input instead?

@eriknw eriknw marked this pull request as draft April 1, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dispatching Related to dispatching and backend support type: Enhancements
Development

Successfully merging this pull request may close these issues.

None yet

4 participants