Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A faster way to add edges and build the network in Python API? #108

Open
ulfaslak opened this issue Mar 2, 2020 · 12 comments
Open

A faster way to add edges and build the network in Python API? #108

ulfaslak opened this issue Mar 2, 2020 · 12 comments

Comments

@ulfaslak
Copy link
Contributor

ulfaslak commented Mar 2, 2020

When clustering big networks using the Python API, I often find that the process of creating the network (adding links by for looping over my edge list and inserting them into the network using im.add_link or im.addLink) takes more time than actually running Infomap.

I assume it is slow because for every im.addLink method call, we send data to the _infomap C++ module. The im.add_links method doesn't help much, it basically just loops over add_link. There is a method of _infomap called StateNetwork_addLink which is called when addLink is called. Ignorant about the workings of the C++ core, could there not be a method called StateNetwork_addLinks which let us pass e.g. a list of lists or a numpy array with edges into the _infomap object in C++ such as to add a bulk of edges more efficiently?

For the data I'm working with right now this would give about a 2x speedup.

@danieledler
Copy link
Contributor

Great suggestion to support a list of lists or a numpy array directly to C++. It doesn't seem trivial so we will comeback to this later!

@antoneri antoneri added the python label Mar 3, 2020
@ulfaslak
Copy link
Contributor Author

No stress, but looking into your commit from 27 days ago, @antoneri, it seems you have made some progress on this. Me and colleagues are using Infomap as a component in infostop to do stop location detection on human mobility data (COVID-19 project, assessing mobility and social distancing). It involves clustering some pretty big dense networks with Infomap and this single step is a huge bottleneck (at least a 50% slowdown). So if there's any chance you have time to work on this in near future, I would be really grateful.

@antoneri
Copy link
Member

Thanks for reminding us @ulfaslak!

@antoneri
Copy link
Member

Can you give an example of your intended usage @ulfaslak?

@ulfaslak
Copy link
Contributor Author

Sure. I'm using Infomap for spatial clustering in my "Infostop" module. Repository here.

I have a function infomap_communities, that takes as input a list of arrays node_idx_neighbors which is essentially a sparse adjacency matrix (node_idx_neighbors[n] returns an array of indices of nodes that n is adjacent to). To cluster this network I have to run a double for loop first where I insert the links into an infomap.Infomap object. Simplified, it looks like this:

for node, neighbors in enumerate(node_idx_neighbors):
    for neighbor in neighbors[neighbors > node]:
        network.addLink(node, neighbor, 1)

Check the implementation here.

While it would be awesome if there was a way to bulk-add this sparse adjacency matrix object, I understand that it's not super general and therefore probably not a sensible thing to implement. BUT what I do think should exist is an addLinks method that takes as input a numpy array (or a list of lists) formatted like:

[[node_i, node_j, weight_{ij}],
 [node_k, node_l, weight_{kl}],
 ...
]

If that existed, I could probably do some numpy optimization magic to fast convert my sparse adjacency matrix into an edgelist like that and then bulk-add my edges to create an Infomap object in no time, saving me (according to my own profiling) roughly 50% runtime.

@antoneri antoneri added the stale label Jul 8, 2021
@antoneri antoneri closed this as completed Jul 8, 2021
@ulfaslak
Copy link
Contributor Author

:(

@danieledler
Copy link
Contributor

danieledler commented Jul 12, 2021

Sorry I have forgotten this, the add_links method was probably added after this issue, it's just a way to hide one loop from the user, running

for link in links:
    self.add_link(*link)

That would work on a 2d numpy array too, and maybe be good enough if you can export an edge list as you mentioned @ulfaslak?

@ulfaslak
Copy link
Contributor Author

Hi Daniel! add_links did in fact exist in this form when I raised this. The issue, specifically, is that links are added to the graph one at a time in a Python loop, which is much slower than if they were added one at a time in a C++ loop.

@antoneri antoneri reopened this Jun 6, 2022
@xiangyh9988
Copy link

xiangyh9988 commented Aug 8, 2022

Hi Daniel! add_links did in fact exist in this form when I raised this. The issue, specifically, is that links are added to the graph one at a time in a Python loop, which is much slower than if they were added one at a time in a C++ loop.

UPDATED:
Oh, I'm sorry. I just notice that there has been an official implementation in #113. Because I didn't find the official addLinks function in the repo I cloned (i.e. v1.9.0), I added the C++ loop manually. Well, since this has been implemented officially, just ignore what I write here.


Hi ulfaslak! I met the same question and try to implement adding link one at a time in a C++ loop by just simply adding a new interface in C++ source codes and recompile them. I test the modified function on the graph with 62627641 links. On my server, the default add_links in a Python loop costs around 278 seconds while the modified addLinks in a C++ loop costs around 129 seconds. I'm not sure whether this is a good solution. What I did are as follows:

  1. Define and implement addLinks function in Infomap.h, StateNetwork.h and StateNetwork.cpp.
    // In Infomap.h
    void addLinks(std::vector<unsigned int> sourceIds, std::vector<unsigned int> targetIds, std::vector<double> weights) { m_network.addLinks(sourceIds, targetIds, weights); }
    // In StateNetwork.h
    bool addLinks(std::vector<unsigned int> sourceIds, std::vector<unsigned int> targetIds, std::vector<double> weights);
    // In StateNetwork.cpp, the flag is not important, just keep up with the return value of addLink
    bool StateNetwork::addLinks(std::vector<unsigned int> sourceIds, std::vector<unsigned int> targetIds, std::vector<double> weights) 
    {
      bool flag = false;
      for (int i = 0; i < sourceIds.size(); ++ i) {
        flag = addLink(sourceIds[i], targetIds[i], weights[i]);
      }
      return flag;
    }
  2. Recompile the infomap (I'm not sure could I call that recompiling because I'm not very good at C++) in the infomap directory.
    make python
    
  3. Copy the new infomap.py and _infomap.cpython-37m-x86_64-linux-gnu.so to the site-package directory in the environment. (The name of .so file might be different due to the platform and python version). Then the im.addLinks can work on arguments of list or numpy array, e.g. im.addLinks(sourceIds, targetIds, weights).

That's what I do to accelerate add links. Correct errors, if any. :)
Because I'm not expert in C++, I can just use a simple C++ loop to accelerate this process. I think the better improvement could be achieved by improving the StateNetwork::addLink function, but I have no idea. Hope for more help here.

@ulfaslak
Copy link
Contributor Author

ulfaslak commented Aug 9, 2022

Hi @xiangyh9988! This is nice. Thanks for reporting the execution times! Looking closer at StateNetwork::addLink I think this is actually a very sensible implementation. You should make a pull request with these changes 🚀!

@xiangyh9988
Copy link

Hi @xiangyh9988! This is nice. Thanks for reporting the execution times! Looking closer at StateNetwork::addLink I think this is actually a very sensible implementation. You should make a pull request with these changes 🚀!

Thanks for your affirmation. At first, I didn't notice that there had been an official implementation with C++ loop in #103, so I try to implement these codes by myself. Since they have finished that, it seems unnecessary to make a pull request now.

@tianyunzqs
Copy link

According to this steps, 3127126 links cost time as follow:
add_links(origin): 253287 ms
addLinks(Recompile): 247731 ms

infomap: 2.6.1
OS: Ubuntu 16.04.4 LTS
python: 3.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Roadmap
  
To do
Development

Successfully merging a pull request may close this issue.

5 participants