Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding the largest common monomorphic subgraph #34

Open
bheijden opened this issue Feb 23, 2023 · 4 comments
Open

Finding the largest common monomorphic subgraph #34

bheijden opened this issue Feb 23, 2023 · 4 comments

Comments

@bheijden
Copy link

Hi,

I am looking for an algorithm that can find the largest common monomorphic subgraph between two graphs. Specifically, I have two graphs G1 and G2, and I would like to identify the largest subgraph (may be non-unique) of G2 that is monomorphic to a subgraph of G1.

I am wondering whether find_motifs_iter can be useful in an iterative method to find an approximation of the largest subgraph of G2 that is a monomorphic to a subgraph of G1. I suppose that somewhere in this method, I may be able to extract the largest considered candidate subgraph that failed to be extended to a full graph monomorphism.

Any ideas on where to start with this?

Thanks!

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 24, 2023

@bheijden this is a super neat use of grandiso, and I imagine it might be very helpful indeed, though it may require some modifications.

The first thing that comes to mind is that find_motifs_iter — as you note — is designed to find perfect matches. For that reason, it's optimized to fail as quickly as possible if it encounters a non-perfect match. You will find that the choice of the start of the candidate motif will dramatically change how much of the graph matches.

For example, if you have a graph A⟷B⟷C⟷A and watch to match it to graph A⟷B⟷C, then consider the following cases:

  • The match starts with node A and jumps to node B. Your largest-common subgraph will be all of ABC.
  • The match starts with node A and jumps to node C. Your largest-common subgraph will fail early and just be A.

One way around this is to "seed" your search with the hint argument to find_motifs_iter, which will selectively start the match at specific nodes or with specific subgraphs already matched. But you will probably also have to cleverly deploy the interestingness argument to "steer" the matching!

Would be happy to discuss more if you think this looks doable!

@bheijden
Copy link
Author

bheijden commented Feb 25, 2023

Thanks for your detailed answer!

First off, I wanted to double check that the steps described in algorithm 1 of the DotMotif paper match the steps taken in grandiso.

Also, what would you say are the main differences with vf2?

Below, I describe my current approach. I am interested to hear your thoughts (and perhaps optimizations I could take).

I created a new function called find_largest_motifs, that keeps track of the largest evaluated candidate (possibly multiple with the same number of nodes). Then, I added a max_evals so that it returns the largest evaluated candidate is returned once max_evals number of candidates have been tested. In other words, it will return in 3 cases:

  • Complete monomorphism found.
  • max_evals reached.
  • All candidates checked.

To address your starting point problem for a graph A⟷B⟷C⟷A and motif A⟷B⟷C, I now add all possible starting point to the candidates queue in the very first evaluation (in get_next_backbone_candidates). I believe the original implementation only queues the most interesting one that structurally matches, right? Do you think that this approach of multi-starting on various starting positions will result in every candidate largest subgraph monomorphism being considered? Or is your original algorithm still cutting short on certain search paths because they may not result in a full monomorphism even though they may result in the largest subgraph monomorphism?

With this approach, I noticed that the queuing order does not results in a depth-first search anymore, because all starting nodes are added as candidates in the queue, and new candidates are enqueued after them. Ideally, I want to do a depth first search, so that I find the largest candidate as fast as possible (as I may bump into max_evals before having evaluated all possibilities). I therefore switched to Deque(policy=QueuePolicy.BREADTHFIRST), which mitigates this to some extend, but I feel the search is still slower than it could be (its now neither depth or breadth first). Ideally, I need to enqueue all possible starting points in the back of the queue, and all new candidates that result from evaluating a candidate in the front. It seems you cannot switch the queuing policy right now, but that may be something i can add as well? What do you think?

Another inefficiency of the implementation above is that two different matches may start from different nodes, but may quickly lead to proposing the same candidates. I could built in a routine that checks whether a candidate has already been checked before, but this may, in itself be costly. Is this something that can be implemented efficiently with the lru_cache? I am not familiar with this python object.

I also found that find_motifs_iter fails when dealing with disconnected motifs (or host, I don't remember...), because the _nodes_with_greatest_backbone_count was empty. so the next_node= max(_nodes_with_greatest_backbone_count) resulted in ValueError: max() arg is an empty sequence. I now check for _nodes_with_greatest_backbone_count being empty, and if so, ,return a set of new candidates, where each candidate is the backbone + the every node that is not yet in the backbone (ordered by interestingness). I would really like your take on this one, because I fear it completely ruins performance... Perhaps I can take a different approach, by first identifying all disconnected graphs in the motif?

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 27, 2023

Tried to address all your comments inline —

First off, I wanted to double check that the steps described in algorithm 1 of the DotMotif paper match the steps taken in grandiso.

That's right, although the grandiso codebase has diverged slightly in favor of performance optimizations that are not reflected in the DotMotif paper. The main examples of this are different queue strategies (which it looks like you're familiar with now!) and some memoization, which makes a huge difference when it can be used.

I created a new function called find_largest_motifs, that keeps track of the largest evaluated candidate (possibly multiple with the same number of vertices). Then, I added a max_evals so that it returns the largest evaluated candidate is returned once max_evals number of candidates have been tested. In other words, it will return in 3 cases...

That sounds exactly right, as long as "largest motif match" means "largest exact subgraph match". My strategy would be—

I now add all possible starting point to the candidates queue in the very first evaluation (in get_next_backbone_candidates)

—ah yes; exactly that!

Do you think that this approach of multi-starting on various starting positions will result in every candidate largest subgraph monomorphism being considered? Or is your original algorithm still cutting short on certain search paths because they may not result in a full monomorphism even though they may result in the largest subgraph monomorphism?

Frankly, I'm not 100% sure, but I am relatively convinced that this should work as you expect, with one exception: I "short-circuit" the search and quit early when I see a structural mismatch (similar to the structural matching done by VF and others). Structural mismatch here just means that the degree of the host vertex is smaller than the degree of the motif vertex (though my recollection is that there were clever additions to this in a VF-derived implementation. Oh well — to wit, I only implement the degree check for now!).

This is done by this function:

def _is_node_structural_match(
motif_node_id: str, host_node_id: str, motif: nx.Graph, host: nx.Graph
) -> bool:

...which you can override in the top-level call to the find_motifs here. i.e., you could override this with a function that always returns true, which will avoid the short-circuit problem. My suspicion is that this gets you closer to your desired behavior!

It seems you cannot switch the queuing policy right now, but that may be something i can add as well? What do you think?

I would love this!! The queue module of this library is a bit neglected, and I have a feeling that clever queuing would really improve the performance profile, even in the exact match case. (I think that the right enqueue/dequeue strategy would lower the memory footprint considerably...) In the short-term, you could try specifying your own queue implementation here in the call to find_motifs, and then writing an implementation that meets the interface in queues — specifically, put, get, and empty methods. The put and get methods can be as clever as you want, even including not being a queue at all but a prioritized LUT or something. Building a smarter queue feels to me like a very powerful aspect of grandiso that I've totally ignored :) Would love to chat more about this if you're feeling inspired to write.

I've written the library so that you don't need to edit the grandiso core to bring your own queue; you should be able to do something like

from grandiso import find_motifs_iter
class MyCleverQueue(...):
    def put(...)
    def get(...)
    def empty(...)

find_motifs_iter(
    ...,
    queue=MyCleverQueue()
)

To be clear: I'd love to merge in some smarter queue logic into the library; and ALSO, you might be able to get away with not having to dig into grandiso guts to make it work!

I am also totally happy to redo some of the core code to accommodate a more advanced replacement for the current queues if you think it's necessary, though I'm sensitive to keeping the current exact-match performance as close to (or better than) its current perf as possible when making the change. And the queue stdlib implementation is VERY fast; so even my attempts to parallelize execution failed because they led to slower wallclock time. (So far three engineers have tried to parallelize grandiso, each with the end result of making it slightly slower; so I am very motivated to figure this out but I am also a cautious realist about it :))

Another inefficiency of the implementation above is that two different matches may start from different nodes, but may quickly lead to proposing the same candidates. I could built in a routine that checks whether a candidate has already been checked before, but this may, in itself be costly. Is this something that can be implemented efficiently with the lru_cache? I am not familiar with this python object.

I would do exactly this — lru_cache is a super powerful tool that replaces the function call with a lookup of the results from the last time the func was called with the exact same args; I think that if you are checking the same candidates more than twice, this would be a huge performance benefit. But it's worth trying with and without! The only addition is the @lru_cache decorator (you can also use @cache from the same functools library if you don't want to cap the size of its memory.)

I also found that find_motifs_iter fails when dealing with disconnected motifs (or host, I don't remember...) ... Perhaps I can take a different approach, by first identifying all disconnected graphs in the motif?

Ah yes — this is another one of my long-standing gotchas!

The main reason this doesn't work out of the box (which is kinda deliberate, but lazy) is that there are some pretty clear performance benefits to searching many times on smaller motifs than once on a big (disconnected) motif; AND there's a HUGE result-count explosion if you allow disconnected motifs (because you count ALL mappings of the second submotif for every valid mapping of the first submotif). Three disconnected motif pieces yields O(count³) results, and so on. Huge numbers! And tons of wasted compute.

So my proposal would be to either count for each component individually, or figure out a way to "conditionally" connect them so that grandiso can walk across the complete motif for the purposes of search... OR maybe there's a third option that is smarter that I'm not thinking of?

@bheijden
Copy link
Author

bheijden commented Mar 10, 2023

Thank you for the long and detailed response! Greatly appreciated :). I'll follow that with an apology for my late reply (I was on Holiday the last two weeks)!

Below my comments inlined:

you could override this (i.e. _is_node_structural_match) with a function that always returns true, which will avoid the short-circuit problem. My suspicion is that this gets you closer to your desired behavior!

Smart! I hadn't thought of this yet. Even though a node may itself not lead to a perfect match (hence the short circuit approach if you are looking for perfect matches), it may actually be part of the largest common subgraph! Thank you for this pointer! I'm going to test this. I only hope that it won't blow up the number of candidates that are to be evaluated.

Building a smarter queue feels to me like a very powerful aspect of grandiso that I've totally ignored :) Would love to chat more about this if you're feeling inspired to write.

Smart queuing is an optimization rabbit hole with so many possibilities, and it may be hard to find a smart queque strategy that generally works for any graph. One could sort the queue based on interestingness, but that operation in itself may be a computationally expensive operation that can ultimately negate any performance gained by the sort. Simple queue strategies where you can, for example, add in the front or back are computationally cheap and may be a simple way to favor promising over unlikely candidates.

I think that if you are checking the same candidates more than twice, this (i.e. lru_cache) would be a huge performance benefit.

From what I understand, is that the lru_cache memorizes whether the callable has already been called with the same arguments. One challenge I've not looked into yet is how it checks if an argument is the same. In this case, the argument would be the candidate mapping (a Dict[motif str, host str]), which is likely ordered differently every time (as the starting point of the search is different). Unpacking two similar but differently ordered candidates mapping to a callable decorated with @lru_cache will probably be viewed as two different arguments as per here:

"Distinct argument patterns may be considered to be distinct calls with separate cache entries. For example, f(a=1, b=2) and f(b=2, a=1) differ in their keyword argument order and may have two separate cache entries."

Perhaps not unpacking the candidates, but providing it as a single argument candidates (similar to the current implementation) may work as long as the two differently ordered candidates result in the same hash value as per here.

is that there are some pretty clear performance benefits to searching many times on smaller motifs than once on a big (disconnected) motif;...

Hmm, but to find the largest subgraph match, you have to consider all these different disconnected motifs and there combinations thereof simultaneously right? If you consider them separately, you may use the same node in the host graph in the mappings of two disconnected graphs in the motif. I do see your problem of the count explosion though...

In my specific case, the host graph is always connected (single graph), a complete multi-partite graph, and directed acyclical. Perhaps there is some optimizations to be done by considering these characteristics of the host?

Also, I was wondering what your take is on using maximum clique finding algorithms for finding the largest subgraph isomorphisms (e.g. here)? Are (modifications of) these algorithms useful in finding monomorphic subgraphs as well ? I can find quite some literature on largest subgraph isomorphism, but much less on (the less strict?) monomorphic variant. And finally, grandiso seems to take a different approach, why?

Btw, I'm probably ab/misusing all the network theoretic definitions. I'll apologize for that upfront (not my field :P).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants