Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on large graphs #135

Open
KeremKurban opened this issue Dec 30, 2023 · 3 comments
Open

Performance on large graphs #135

KeremKurban opened this issue Dec 30, 2023 · 3 comments
Assignees

Comments

@KeremKurban
Copy link

Hello, I have a huge graph i want to calculate motifs from but for years i have been struggling to find a scalable library to calculate motifs. How does dotmotif scale ? Is there a Big O complexity analysis for motifs with participation and without participation ?

@j6k4m8 j6k4m8 pinned this issue Jan 2, 2024
@j6k4m8
Copy link
Member

j6k4m8 commented Jan 2, 2024

Hi @KeremKurban! DotMotif uses GrandIso to run most graph queries, though it can also use Neo4j for even larger graphs, if you have a database running. (Or you can use our built-in graph importer to run a container locally!)

That means that our speed is mostly a function of the speed of GrandIso (which is very similar to the VF2 algorithm). We discuss GrandIso a bit more in this paper and have some benchmark figures comparing us to the built in NetworkX subgraph monomorphism search:

image

In general, we tend to be way faster than NetworkX, but still somewhere between O(V²) and O(V! V) (the time complexity for best- and worst-case subgraph isomorphism search).

Interestingly, I've tried in the past (and failed!!) to parallelize the GrandIso code. It always tends to run a little slower than the unparallelized version. I cannot wrap my head around how to get past the memory-sharing barriers. Thoughts welcome here if you want to hack.


Somewhere I have some Neo4j benchmark figures as well, but for small graphs (1000s of vertices, 10,000s of edges) Python in-memory with GrandIso tends to be faster. For millions of edges, I've been able to get some motifs to run on Neo4j, but it's really a function of how much memory you allocate to the database. In some extreme cases we've allocated around 500GB of RAM to Neo4j and to GrandIso and still saw moderate performance benefits from GrandIso, but I would imagine Neo4j has improved since I ran these tests.


For enormous graphs, you probably want to use something like this, which uses serverless cloud infrastructure to scale way up and search for motifs in extreme parallel. (If you do this, I would LOVE to know how it goes; please get in touch so we can chat more! We've done this only in very limited cases and never for benchmarking due to the cost.)

I imagine that you could mock a lot of these technologies locally to get better parallelized performance; but in my experience, this really doesn't pay off until you get to supercomputer scale.


Finally I will of course mention that the speed completely depends on the motifs you're searching for. Searching for something simple,

A -> B

will always be very fast. Searching for something more complex, like a triangle,

A -> B
B -> C
C -> A

will take much longer, but should still be fast. Adding constraints using DotMotif,

A -> B
B -> C
C -> A as LastEdge

LastEdge.weight < 10
A.mass > 1900

will make things way faster, because DotMotif+GrandIso is smart enough to downselect where it's searching to satisfy the most specific attributes first.

But there are some motifs, like paths or stars,

# Path
A -> B
B -> C
C -> D
# Star
A -> S1
A -> S2
A -> S3
A -> S4

will always be very slow, purely due to the huge number of permutations possible.

In general, a good rule of thumb is that compact, high-density, highly-attributed motifs run quickly just about everywhere, and low-density, attribute-free motifs run slowly just about everywhere (even in very small graphs).

@j6k4m8 j6k4m8 self-assigned this Jan 2, 2024
@j6k4m8 j6k4m8 unpinned this issue Jan 2, 2024
@KeremKurban
Copy link
Author

Thanks for the detailed explanation @j6k4m8 . I sampled 6k neurons from our in-silico circuit and run triplet motifs within an hour. Once I finished writing adapters for SONATA circuits and analyzed the results, I can check grandiso-cloud to compare performance and see if I can help.

To use the node and edge features, is it OK to just add them while building the networkX object ?

@j6k4m8
Copy link
Member

j6k4m8 commented Jan 4, 2024

That's awesome! There might even be faster ways to get triplet results but 1 hour seems pretty reasonable.

It might be interesting (and perhaps even easier!) to add SONATA adapters to Grand, since that will give you a NetworkX-like interface if you write the IO to SONATA API calls. Happy to discuss more if you like; this also gets you Cypher support more broadly.

For GrandIso matches, yes that's right -- you can just add attributes directly to your motif!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants