New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Cache graphs objects when converting to a backend #7345
Conversation
Thanks! This is great. I'll work on an example to demonstrate the speedup when multiple algo calls are made using the same Graph. It should be significant based on benchmark comparisons with and without conversion cost. |
How do y'all like this warning message?
|
In the dispatching meeting today, we proposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a pass through. I hope my comments are clear. please ask for clarity or push back against suggestions as desired. :}
@@ -146,7 +146,7 @@ def test_modularity_communities_directed_weighted(): | |||
|
|||
# A large weight of the edge (2, 6) causes 6 to change group, even if it shares | |||
# only one connection with the new group and 3 with the old one. | |||
G[2][6]["weight"] = 20 | |||
G.add_edge(2, 6, weight=20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates that caching would not work for this test. Did you anticipate that, or was it the results of trying the test in a cached environment that led to this test change.
I'm reluctant to switch all of our testing to use code that won't break caching.
Perhaps we could do it both ways -- a test that caching works (using add_edge
) and a test that changing directly gives the wrong answer when using a backend with caching turned on. What do you think?
What breaks if we leave this test as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing breaks if I leave this as is, because we disable caching when testing backends. But, I may sometimes experiment with enabling caching while testing, and if we can get it to work, it may be a nice option to add for testing (but probably not b/c of the maintenance burden).
Up to you. I can revert these changes to the tests. I don't think there would be that many more changes to be cache-friendly though, so I thought "might as well".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about, for now, revert and add a line like G.__networkx_cache__.clear()
?
That signals that this kind of change would break caching. And when we get to testing caching in the tests, we can replace it with a test that caching would break and add the add_edge
version that doesn't break.
if attr is not None: | ||
G.__networkx_cache__.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave out this barycenter cache clear. But I'm willing to leave it in for this PR.
I'd like to change barycenter to avoid the problem. When we make that change we can take this line out again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, sounds good. I'd like to keep it in for now for maximum safety, but, yeah, I'd prefer to change the API so we can remove this. We can look towards more optimizations in the future.
@@ -94,6 +94,7 @@ def bidirectional_bfs(): | |||
path.append(u) | |||
flow_value += augment(path) | |||
|
|||
R.__networkx_cache__.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that this function (edmonds_karp_core
) should not be dispatchable. It is a helper function for the dispatchable function edmonds_karp
. If it was written today it would be a private function. I'm fine with making it a private function -- but it hasn't been (inertia and very unlikely backwards compatibility).
The other function in this module is edmonds_karp_impl
which is also a helper function. Why is one helper function decorated as dispatchable and the other not? Can we pull the dispatchable from edmonds_karp_core
? Would this mess you up in some way? Would it be better for me to make it a private function then?
(I guess I'm saying that tis function should not be part of the networkx api in the sense of having backends implement it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can make it not dispatchable. When I initially added the dispatch decorator, I cast a pretty wide net, so if it had a graph argument and name didn't start with an underscore, I probably decorated it. We should probably remove it from more of these "supposed to be private" functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it probably would be good to remove the dispatchable decorator from the helper functions that should be private. But it's tricky to determine which are helper functions. There are unfortunately quite a few of those in this library, so it makes sense to cast a wide net and go back to filter them out.
networkx/utils/backends.py
Outdated
use_cache | ||
and (cache := getattr(graph, "__networkx_cache__", None)) is not None | ||
): | ||
cache = cache.setdefault("backends", {}).setdefault(backend_name, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't do what I think you intend here (cache is an empty dict after this line). When I proposed improving the setdefault
function 15 years ago to CPython, the reviewer responded that setdefault
probably shouldn't be used anymore -- it doesn't serve any function not provided elsewhere and its slow, confusing and causes surprises. They couldn't remove it for backward compat reasons. So... i was surprised it didn't disappear in python3. But in any case, this is more evidence that we should just ban setdefault from the library. :)
if "backends" not in cache:
cache["backends"] = {}
if backend_name not in cache:
cache[backend_name] = {}
Or, if you prefer
cache["backends"] = cache.get("backends", {})
cache[backend_name] = cache.get(backend_name, {})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun story, thanks for sharing :)
I'll say the same thing to you--I don't think that code does what I think you intend. It's not what I intend.
My code is as I intend, and it works (I would have given a demo on Thursday if we had time). Consider this example:
>>> d = {}
>>> y = d.setdefault('x', {}).setdefault('y', {})
>>> d
{'x': {'y': {}}}
>>> y
{}
>>> d.setdefault('x', {}).setdefault('y', {}) is y
True
I don't use setdefault
very often, but sometimes it's the right pattern to use, and I think this is one of those times. Heh, this is pretty much the only pattern it's good for. It's also gotten love since 15 years ago--it's faster, and it's now atomic (this is important to me).
I should probably add a comment that this creates nested dicts such as:
# Using `setdefault` creates nested dicts if they don't already exist.
# After calling setdefaults, `cache` could be gotten like this:
# `cache = graph.__networkx_cache__["backends"][backend_name]`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow -- the resulting cache
dict is the inner dict -- not the outer dict. Not what I thought this code was doing. And it all makes very good sense. And, I'll take back my comment about never using setdefault
.
I think the code would be easier (at least for me) to read with different names for the two cache
objects.
I also had to pause to re-read the if-clause here -- the "walrus" makes the code compact but harder for me to read (does two things, so I read it twice). I guess it saves a hasattr
call.
What do you think about something like this code:
if use_cache and hasattr(graph, "__networkx_cache__"):
nx_cache = getattr(graph, "__networkx_cache__")
# make nested {"backends": {backend_name: cache}} structure if not there.
cache = nx_cache.setdefault("backends", {}).setdefault(backend_name, {})
Introducing a name like nx_cache
is the part of this I think is really helpful. I think two names and the comment with a nested dict literal are sufficient for me to be able to read this 2 years from now.
Thanks for taking a close look @dschult. I really appreciate suggestions that help improve clarity. I've made the changes locally. I'm also paying off some technical debt, so it may be a day or two before I can update the PR. For example, I'm changing |
I guess we should add a sentence to the warning message about how to handle cases where they do want to mutate the graph directly. All they need to do is clear the cache. So perhaps add a final sentence like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eriknw ! These changes look good to me.
I do have a question about how the backends are envisioned for adjacency matrix info.
Q: When there are more than one edge attribute, there is more than one adjacency matrix to consider. Do we expect the backend graph structure to cache all the edge attribute matrices at once? I was thinking that each edge attribute would lead to a different backend representation (my grounding example being a sparse matrix representation). But reading through the latest commit I am beginning to think we are designing this to clear all edge-attribute matrices when an unrelated edge attribute is changed.
How do nx-cugraph and nx-graphblas reprresent the graph? Does it include all edge attributes or only one at a time?
nx-cugraph stores all edge attributes. It has a dict of arrays of values, and a dict of arrays of masks if necessary. The GraphBLAS backend uses a GraphBLAS Matrix to represent the graph. This can only handle a single attribute today, which can lead to multiple cached graphs. How about I add a new function such as |
I like the idea of a function to clear the cache so if we change how we want to handle it, the changes in code will be less widely spread. Should that be a method on the Graph? That's keep the function close to the data it is manipulating. But I'm open to a function if there's an advantage to that over a method. |
Good question. I prefer to use a function for now to see if we can keep it generic enough to operate on any backend graph or object that has Similarly, a reason I don't assume graph objects have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve this PR. I'm sure there will be more changes as we move forward. But this is a good step!
fyi, here is a tree view of all dispatched functions. Do any jump out at you that obviously shouldn't be dispatched? Dispatched NetworkX functions (dispatch name in parentheses)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I can't wait to see this feature get merged. I didn't notice anything wrong but I did have a few questions.
networkx/utils/backends.py
Outdated
"You may also use `G.__networkx_cache__.clear()` to " | ||
"manually clear the cache, or set `G.__networkx_cache__` " | ||
"to None to disable caching for G. Enable or disable " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to advertise direct access to the dunder? I wonder if it would be better to support manually clearing the cache later with a new API or even just nx._clear_cache()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dan suggested this here: #7345 (comment)
Would it be better to suggest nx._clear_cache(G)
instead? Should we make it not private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like that better, but I don't feel strongly either way. I'm also thinking we can keep it named with the sunder since it's easier to go private->public later.
graph_name=graph_name, | ||
) | ||
if use_cache and nx_cache is not None: | ||
# Remove old cached items that are no longer necessary since they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I lost track of how this works; can there be multiple converted graphs cached for each backend? I was under the impression the cache contains only the last converted graph for each backend, and the logic above was to support returning a cached graph that's potentially a superset (attribute-wise) of the graph needed. If that's not the case, I'm just wondering if this could use an unexpectedly large amount of memory (thinking of nx-cugraph instances and GPU memory in particular).
I think we are running into a problem of expectations. Each of us (Rick, Erik, Dan and Mridul) have different ideas of what the cache is supposed to store. We're confounding storing graph objects, storing function results and storing partial results that could be used for future computations. While it might be possible to provide all of these with a single caching mechanism, it is also possible that it'd be better to provide separate mechanisms for the different types of caching. As for caching converted graphs, the difficulty as I understand it is that some backends -- like cugraph -- convert all the node and edge attributes into the new structure, while other backends --like graphblas -- only convert one edge attribute. So some will naturally only want one version of the graph cached while others will want an option for more than one cached representation. Yet others will likely want to cache partially computed results. I'd like our caching mechanism to be cross-backend friendly. That is, one backend should not interfere with another's cache. Then each backend is responsible for updating their cached objects. I guess I'm envisioning a system where NetworkX provides a way for the backends to mark each cached object as needing to be cleared when certain NetworkX graph mutations are made. Then the backends don't directly clear the whole cache. They clear "their" stored objects as needed. And NetworkX clears cached objects from any backend when graph mutations are made. Which cached objects? The ones that have been tagged as being cleared for that mutation. For example: we could have a dict of cache keys to be cleared when nodes are added/removed, another when edges are added/removed. I'm not sure how to handle the cases when node/edge attributes are changed. I think the current setup tries to clear the whole cache when any mutation is made. And that might be the right implementation. But we will be losing the opportunity to cache e.g. a graphblas matrix representation for attribute "edge_length" when we update the attribute "edge_capacity". Is this a big deal? I don't know. :) But I guess the first check of our design could be "does it work well for cugraph, graphblas and an imagined scipy_sparse backend"? |
Quick reactions:
I agree with this goal. This has been a principle driving the design of dispatching. And I believe this PR does.
Currently, yes, for two important reasons:
I like the idea of selectively clearing the cache based on what kind of mutation is made, and this PR doesn't prevent us from doing that in the future. If we tried to add selective cache clearing now, I worry it would either delay 3.3 or not get in 3.3. I would rather get this PR in 3.3, then give us plenty of time to discuss improving caching. |
I agree with all those comments @eriknw I have been (hopefully) more productive too. I went through your list of dispatchable functions (~4 comments up). Here are the ones I think are really just helper functions. Two groups of such are:
To be clear, I'm not saying these need to have dispatching removed. I'm saying that these are functions I flagged as not needing dispatching because they either duplicate another function (e.g. My list of functions (including the module info on previous line):
|
I have approved this PR and my recent postings only continue conversation. They don't add to anything needed for this PR. It sounds like @rlratzel looked through this fairly closely too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes LGTM, thanks! The remaining questions from my prior review were mainly for my own understanding and need not hold up my approval.
@rossbar can you take a look at this PR? Some attendees of GTC have requested more info about the caching data shown in Rick and Mridul's talk there. So, it'd be good to get this in. @jarrodmillman once this is approved and merged, can we make another rc (or a full release -- though we probably need to get the numpy issues working for the full release... If that starts taking a while (more than 2 more weeks or so?) maybe we can push those into 3.3.1 or something.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing blocking here for me! I do want to revisit the backends.py file later, we should probably explain the arch somewhere.
- Enable caching to be disabled by setting `__networkx_cache__` to None (or delete it). - Improve warning to say how to manually clear cache. - Update some `_dispatchable` decorators that were discovered to mutate data. - Make the residual Graph an implementation detail and not dispatchable. - Improve dispatch tests that mutate input (but more still needs done).
e6c7fe5
to
7210b01
Compare
Hmm, rebasing the PR also didn't fix the benchmark unknown commit error. I'll have a look at the benchmark bits later. Merging this in for now! Thanks everyone!! |
* Minimal PR to add cache dict to graphs and clear them * Cache backend graph conversions * Don't use cache when testing backends * Better caching * fix typo * Don't be silly; be optimistic. Also, update comments * Add warning when using a cached value * Make caching more robust - Enable caching to be disabled by setting `__networkx_cache__` to None (or delete it). - Improve warning to say how to manually clear cache. - Update some `_dispatchable` decorators that were discovered to mutate data. - Make the residual Graph an implementation detail and not dispatchable. - Improve dispatch tests that mutate input (but more still needs done). * Add config to control caching * Use `nx._clear_cache` to clear the cache * Add note about config being global * Disable cache for `maximum_branching` internal graph * DRY
This builds off of #7344.
Caching backend graph conversions can provide a huge performance (and usability) benefit to users. It comes at the cost of increased memory, so it would be nice to be able to control caching via config (#7225).
CC @rlratzel