Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Relux-the-Relux · 2021-04-21T09:39:42Z

Addressing #735 and #736.

We now save the inserted edges in a Hash table to quickly check if the edge is already in the Graph. And now we have the option of passing the argument unique if we want to forego any checks to be even faster.

Edges are now sorted before being indexed so that we can binary search instead of linear searching the neighbors later.

Relux-the-Relux · 2021-04-21T11:16:40Z

It seems that the new EdgeIndexing has a weird interaction with the compact edge in the undirected case for some reason, still unsure why.

… graph is indexed instead of before the indexing

Relux-the-Relux · 2021-04-21T13:48:03Z

Ok, it was just because IndexEdges now also sorts the edges. Changed the test to take the edge order tight after being indexed instead of taking the original edge order.

fabratu · 2021-04-22T14:10:36Z

networkit/cpp/graph/test/GraphGTest.cpp

+    std::vector<std::pair<node, node>> outEdges;
+    outEdges.reserve(G.numberOfEdges());
+
+    G.forEdges([&](node u, node v) { outEdges.emplace_back(u, v); });


With this change, there is no comparison for the copied graph and the original one anymore (see line 2171 on master-branch).

fabratu · 2021-04-23T12:37:07Z

include/networkit/graph/Graph.hpp

@@ -149,6 +149,10 @@ class Graph final {
     */
    index indexInOutEdgeArray(node u, node v) const;

+    index indexSortedInInEdgeArray(node v, node u) const;


(For discussion) Searching neighbors in sub-linear time is something I would assume several algorithms can benefit from. Therefore instead of implementing a separate function, better have a bookkeeping-variable about whether the neighbor-lists are sorted or not. sortEdges (or similar like #734) sets the variable to true, every other manipulation of edge-lists sets it to false. The already implemented index-functions then check for this variable and either use normal or binary-search.

The problem is adding a flag to check it is sorted would mean slower insertion since, each time we add an edge we would then need to check if the Graph is sorted or now.

Basically addEdge is a manipulation of one (or more) edge-lists, therefore the affected list is not anymore sorted afterwards and the variable is set to false. That is what I meant with "every other manipulation of edge-lists". But maybe there are other views on this, since this a very naive (but rather cheap) approach.

I actually had the same idea as @fabratu mentioned here. Just set the (one, global) variable to false whenever addEdge, removeEdge or any other method that manipulates edge lists is called. I assume that a lot of use cases of NetworKit involve static graphs so this should be fine. We could introduce a small optimization, though: When adding an edge, you could quickly check if the new neighbor is larger than the last neighbor (or you are adding the first neighbor). If yes, the flag doesn't need to be set to false. This, in addition to setting the flag to true for empty graphs, could allow reading sorted graphs such that the flag is true at the end. For example, many METIS graph files are actually sorted and thus you could avoid the additional sorting step.

Some more remarks regarding this new feature of a "sorted" state in the graph:

It should be possible for users to query if the graph is sorted as this may affect the performance of methods and algorithms.

The effect of sortEdges on the performance of certain methods should be clearly mentioned in the documentation of sortEdges. sortEdges should not do anything if the graph has already been sorted such that users can simply always call sortEdges in a graph processing pipeline.

It should be documented that reading sorted graphs may give a sorted graph. This could be done in graph readers that actually sort the graph, in addEdge and in the class documentation of the graph class where this feature should be mentioned imho.

It would also be useful to identify algorithms that would benefit from sorted edges. These are, e.g., algorithms calling weight or hasEdge on a graph. Such algorithms could issue a warning (or info) message if the graph is not sorted and their documentation should be changed to mention that they benefit from a sorted graph.

Finally querying whether the edges are sorted or not is not a good API either. What users should be interested in is hasSublinearEdgeQueries().

In general, I think that adding functions to the Graph class should require more rigorous review than other modifications to NetworKit.

I agree.

In general, I think that adding functions to the Graph class should require more rigorous review than other modifications to NetworKit.

Agreed.

One global variable + check-function and not adding indexSortedInInEdgeArray (and such) would keep the changes to Graphminimal, if that's part of being rigorous. I think it is likely, that this change results in speed-up for several modules in the framework. If we want to thoroughly test this, maybe it is a good idea to first push for a benchmarking-tool though.

Finally querying whether the edges are sorted or not is not a good API either. What users should be interested in is hasSublinearEdgeQueries().

I am not sure about this as the property becoming true is an immediate effect of either explicitly calling a sort function or reading a sorted graph, both of which have the user-visible effect that the iteration order of neighbors is now increasing by node id. If the user wants to know if the edges are sorted by node id, it seems like a good idea to have a simple method to query this property. Otherwise, users might start using hasSublinearEdgeQueries() to understand if the iterator will return edges in sorted order which is bad if at some point this method returns true even though edges are not sorted. We could of course additionally support checking if queries for edges are sub-linear, but then again sub-linear is a rather weak property. What if at some point we implement hash sets for neighbors and edge queries become constant time? What if those hash tables are again optional, and we now have three possibilities: 1) linear time, 2) logarithmic time due to sorting, and 3) constant time due to hash tables? Algorithms might have different trade-offs depending on the running time of edge queries. For example for triangle counting, we could reduce memory usage if there were constant-time edge queries but logarithmic time would probably be too slow to allow discarding the additional bit vector (per thread) for constant-time edge queries.

Note that Eugenio's PR adds the a function to sort EdgeLists in an arbitrary way. How will that interact with a sorted flag?

It will set the flag to false. It could make sense though to change sortEdges() to use this new function and then set the flag to true.

We should also consider the case that edge indexing is used without sorted adjacency lists. We already export the order of the adjacency lists as public API. Hence, one would expect that it is possible to have indexed adjacency lists without re-ordering the edge lists.

I think indexEdges() should only optionally trigger sorting adjacency lists or just be faster if adjacency lists are sorted. This should be easily achievable if - as discussed here - simply indexInOutEdgeArray exploits the sorted state. Adding an additional flag to indexEdges() that would trigger a call to sortEdges() before indexing the edges would be completely optional then.

With respect to benchmarking, it would be interesting to see if always using binary search is beneficial or if we should use a linear scan for nodes of small degree. My intuition would be that for small degrees a scan should be faster as it won't have as many mispredicted branches but my intuition could also be completely wrong.

In #715, @avdgrinten said that there are plans for more efficient memory allocations. I do not know what you have planned but I think it is important to ensure that the changes in this PR do not conflict with your plans.

What if at some point we implement hash sets for neighbors and edge queries become constant time?

That's exactly the reason why I am against adding an API that requires sorting (or that sorts implicitly). There might be faster methods to implement edge queries and demanding sorting constrains us to one specific implementation.

fabratu · 2021-04-23T12:38:17Z

networkit/cpp/graph/Graph.cpp

+        return indexSortedInOutEdgeArray(v, u);
+    }
+
+    index l = 0;


Instead of implementing this functionality (twice), use std::lower_bound().

fabratu · 2021-04-23T12:39:57Z

networkit/cpp/graph/Graph.cpp

 /** EDGE IDS **/

 void Graph::indexEdges(bool force) {
    if (edgesIndexed && !force)
        return;

+    // Sort outedges and inedges so that we can binary search for them


Instead of basically copy the functionality of sortEdges, set edgesIndexed here to false end then call sortEdges.

fabratu · 2021-04-23T13:06:55Z

networkit/cpp/io/EdgeListReader.cpp

 Graph EdgeListReader::read(const std::string &path) {
    this->mapNodeIds.clear();
    MemoryMappedFile mmfile(path);
    auto it = mmfile.cbegin();
    auto end = mmfile.cend();
+    std::unordered_set<std::pair<node, node>, pairhash> insertedEdges;


Instead of managing an additional set of edges, better as @michitux suggested, insert the edges without checks. After everything is inserted, sort each list (with either the already existing functionality or what @angriman introduced in #734) and remove duplicates. Also see here for comparison between hash-sets and vector-sort-unique: https://stackoverflow.com/questions/1041620/whats-the-most-efficient-way-to-erase-duplicates-and-sort-a-vector

hmeyerhenke · 2021-04-26T07:15:13Z

I don't want to interfere with technical details, but please use proper and systematic benchmarking to ensure that the changes do not hurt performance in an unacceptable way. I am more than willing to discuss what "(un)acceptable" means when benchmarking data from various use cases and dozens of graphs are available.

avdgrinten · 2021-06-17T10:12:25Z

We probably want to defer this PR until the next release is out. There are too many open questions to merge it now.

peterlqa added 3 commits April 15, 2021 13:03

binary search

87c740b

improved reader

ac4d8ee

fixed bug in the edge binary search

4affe39

Relux-the-Relux linked an issue Apr 21, 2021 that may be closed by this pull request

Graph::indexEdges() is quadratic in node degrees #736

Open

Relux-the-Relux changed the title ~~Perfemance improvement to Edge Indexing and The Edge List Reader.~~ Perfomance improvement to Edge Indexing and The Edge List Reader. Apr 21, 2021

Relux-the-Relux linked an issue Apr 21, 2021 that may be closed by this pull request

EdgeListReader is quadratic in node degrees #735

Open

Relux-the-Relux added the performance label Apr 21, 2021

Relux-the-Relux marked this pull request as draft April 21, 2021 09:53

fix bugs when reading directed edges

30f7a4d

Relux-the-Relux force-pushed the feature/quadratic-edges branch from 9a3b81b to 30f7a4d Compare April 21, 2021 11:06

Fix on the compactEdges test by taking the node order right after the…

2ef237b

… graph is indexed instead of before the indexing

Relux-the-Relux marked this pull request as ready for review April 21, 2021 13:48

Relux-the-Relux changed the title ~~Perfomance improvement to Edge Indexing and The Edge List Reader.~~ Perfomance improvement to Edge Indexing and the Edge List Reader. Apr 21, 2021

fabratu requested changes Apr 23, 2021

View reviewed changes

fabratu added the stagnant label Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Relux-the-Relux commented Apr 21, 2021

Relux-the-Relux commented Apr 21, 2021

Relux-the-Relux commented Apr 21, 2021

fabratu Apr 22, 2021

fabratu Apr 23, 2021

Relux-the-Relux Apr 23, 2021

fabratu Apr 23, 2021

michitux Apr 23, 2021

michitux Apr 23, 2021 •

edited

avdgrinten May 3, 2021

hmeyerhenke May 3, 2021

fabratu May 4, 2021 •

edited

michitux May 4, 2021

avdgrinten May 5, 2021 •

edited

fabratu Apr 23, 2021

fabratu Apr 23, 2021

fabratu Apr 23, 2021

hmeyerhenke commented Apr 26, 2021

avdgrinten commented Jun 17, 2021

Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Are you sure you want to change the base?

Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Conversation

Relux-the-Relux commented Apr 21, 2021

Relux-the-Relux commented Apr 21, 2021

Relux-the-Relux commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michitux Apr 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabratu May 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

avdgrinten May 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmeyerhenke commented Apr 26, 2021

avdgrinten commented Jun 17, 2021

michitux Apr 23, 2021 •

edited

fabratu May 4, 2021 •

edited

avdgrinten May 5, 2021 •

edited