Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfomance improvement to Edge Indexing and the Edge List Reader. #741

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Relux-the-Relux
Copy link

Addressing #735 and #736.

We now save the inserted edges in a Hash table to quickly check if the edge is already in the Graph. And now we have the option of passing the argument unique if we want to forego any checks to be even faster.

Edges are now sorted before being indexed so that we can binary search instead of linear searching the neighbors later.

@Relux-the-Relux Relux-the-Relux linked an issue Apr 21, 2021 that may be closed by this pull request
@Relux-the-Relux Relux-the-Relux changed the title Perfemance improvement to Edge Indexing and The Edge List Reader. Perfomance improvement to Edge Indexing and The Edge List Reader. Apr 21, 2021
@Relux-the-Relux Relux-the-Relux linked an issue Apr 21, 2021 that may be closed by this pull request
@Relux-the-Relux Relux-the-Relux marked this pull request as draft April 21, 2021 09:53
@Relux-the-Relux
Copy link
Author

It seems that the new EdgeIndexing has a weird interaction with the compact edge in the undirected case for some reason, still unsure why.

… graph is indexed instead of before the indexing
@Relux-the-Relux
Copy link
Author

Ok, it was just because IndexEdges now also sorts the edges. Changed the test to take the edge order tight after being indexed instead of taking the original edge order.

@Relux-the-Relux Relux-the-Relux marked this pull request as ready for review April 21, 2021 13:48
@Relux-the-Relux Relux-the-Relux changed the title Perfomance improvement to Edge Indexing and The Edge List Reader. Perfomance improvement to Edge Indexing and the Edge List Reader. Apr 21, 2021
std::vector<std::pair<node, node>> outEdges;
outEdges.reserve(G.numberOfEdges());

G.forEdges([&](node u, node v) { outEdges.emplace_back(u, v); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, there is no comparison for the copied graph and the original one anymore (see line 2171 on master-branch).

@@ -149,6 +149,10 @@ class Graph final {
*/
index indexInOutEdgeArray(node u, node v) const;

index indexSortedInInEdgeArray(node v, node u) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For discussion) Searching neighbors in sub-linear time is something I would assume several algorithms can benefit from. Therefore instead of implementing a separate function, better have a bookkeeping-variable about whether the neighbor-lists are sorted or not. sortEdges (or similar like #734) sets the variable to true, every other manipulation of edge-lists sets it to false. The already implemented index-functions then check for this variable and either use normal or binary-search.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is adding a flag to check it is sorted would mean slower insertion since, each time we add an edge we would then need to check if the Graph is sorted or now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically addEdge is a manipulation of one (or more) edge-lists, therefore the affected list is not anymore sorted afterwards and the variable is set to false. That is what I meant with "every other manipulation of edge-lists". But maybe there are other views on this, since this a very naive (but rather cheap) approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had the same idea as @fabratu mentioned here. Just set the (one, global) variable to false whenever addEdge, removeEdge or any other method that manipulates edge lists is called. I assume that a lot of use cases of NetworKit involve static graphs so this should be fine. We could introduce a small optimization, though: When adding an edge, you could quickly check if the new neighbor is larger than the last neighbor (or you are adding the first neighbor). If yes, the flag doesn't need to be set to false. This, in addition to setting the flag to true for empty graphs, could allow reading sorted graphs such that the flag is true at the end. For example, many METIS graph files are actually sorted and thus you could avoid the additional sorting step.

Copy link
Contributor

@michitux michitux Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more remarks regarding this new feature of a "sorted" state in the graph:

  • It should be possible for users to query if the graph is sorted as this may affect the performance of methods and algorithms.
  • The effect of sortEdges on the performance of certain methods should be clearly mentioned in the documentation of sortEdges. sortEdges should not do anything if the graph has already been sorted such that users can simply always call sortEdges in a graph processing pipeline.
  • It should be documented that reading sorted graphs may give a sorted graph. This could be done in graph readers that actually sort the graph, in addEdge and in the class documentation of the graph class where this feature should be mentioned imho.

It would also be useful to identify algorithms that would benefit from sorted edges. These are, e.g., algorithms calling weight or hasEdge on a graph. Such algorithms could issue a warning (or info) message if the graph is not sorted and their documentation should be changed to mention that they benefit from a sorted graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally querying whether the edges are sorted or not is not a good API either. What users should be interested in is hasSublinearEdgeQueries().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think that adding functions to the Graph class should require more rigorous review than other modifications to NetworKit.

I agree.

Copy link
Member

@fabratu fabratu May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think that adding functions to the Graph class should require more rigorous review than other modifications to NetworKit.

Agreed.

One global variable + check-function and not adding indexSortedInInEdgeArray (and such) would keep the changes to Graphminimal, if that's part of being rigorous. I think it is likely, that this change results in speed-up for several modules in the framework. If we want to thoroughly test this, maybe it is a good idea to first push for a benchmarking-tool though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally querying whether the edges are sorted or not is not a good API either. What users should be interested in is hasSublinearEdgeQueries().

I am not sure about this as the property becoming true is an immediate effect of either explicitly calling a sort function or reading a sorted graph, both of which have the user-visible effect that the iteration order of neighbors is now increasing by node id. If the user wants to know if the edges are sorted by node id, it seems like a good idea to have a simple method to query this property. Otherwise, users might start using hasSublinearEdgeQueries() to understand if the iterator will return edges in sorted order which is bad if at some point this method returns true even though edges are not sorted. We could of course additionally support checking if queries for edges are sub-linear, but then again sub-linear is a rather weak property. What if at some point we implement hash sets for neighbors and edge queries become constant time? What if those hash tables are again optional, and we now have three possibilities: 1) linear time, 2) logarithmic time due to sorting, and 3) constant time due to hash tables? Algorithms might have different trade-offs depending on the running time of edge queries. For example for triangle counting, we could reduce memory usage if there were constant-time edge queries but logarithmic time would probably be too slow to allow discarding the additional bit vector (per thread) for constant-time edge queries.

Note that Eugenio's PR adds the a function to sort EdgeLists in an arbitrary way. How will that interact with a sorted flag?

It will set the flag to false. It could make sense though to change sortEdges() to use this new function and then set the flag to true.

We should also consider the case that edge indexing is used without sorted adjacency lists. We already export the order of the adjacency lists as public API. Hence, one would expect that it is possible to have indexed adjacency lists without re-ordering the edge lists.

I think indexEdges() should only optionally trigger sorting adjacency lists or just be faster if adjacency lists are sorted. This should be easily achievable if - as discussed here - simply indexInOutEdgeArray exploits the sorted state. Adding an additional flag to indexEdges() that would trigger a call to sortEdges() before indexing the edges would be completely optional then.

With respect to benchmarking, it would be interesting to see if always using binary search is beneficial or if we should use a linear scan for nodes of small degree. My intuition would be that for small degrees a scan should be faster as it won't have as many mispredicted branches but my intuition could also be completely wrong.

In #715, @avdgrinten said that there are plans for more efficient memory allocations. I do not know what you have planned but I think it is important to ensure that the changes in this PR do not conflict with your plans.

Copy link
Contributor

@avdgrinten avdgrinten May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if at some point we implement hash sets for neighbors and edge queries become constant time?

That's exactly the reason why I am against adding an API that requires sorting (or that sorts implicitly). There might be faster methods to implement edge queries and demanding sorting constrains us to one specific implementation.

return indexSortedInOutEdgeArray(v, u);
}

index l = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing this functionality (twice), use std::lower_bound().

/** EDGE IDS **/

void Graph::indexEdges(bool force) {
if (edgesIndexed && !force)
return;

// Sort outedges and inedges so that we can binary search for them
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of basically copy the functionality of sortEdges, set edgesIndexed here to false end then call sortEdges.

Graph EdgeListReader::read(const std::string &path) {
this->mapNodeIds.clear();
MemoryMappedFile mmfile(path);
auto it = mmfile.cbegin();
auto end = mmfile.cend();
std::unordered_set<std::pair<node, node>, pairhash> insertedEdges;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of managing an additional set of edges, better as @michitux suggested, insert the edges without checks. After everything is inserted, sort each list (with either the already existing functionality or what @angriman introduced in #734) and remove duplicates. Also see here for comparison between hash-sets and vector-sort-unique: https://stackoverflow.com/questions/1041620/whats-the-most-efficient-way-to-erase-duplicates-and-sort-a-vector

@hmeyerhenke
Copy link

I don't want to interfere with technical details, but please use proper and systematic benchmarking to ensure that the changes do not hurt performance in an unacceptable way. I am more than willing to discuss what "(un)acceptable" means when benchmarking data from various use cases and dozens of graphs are available.

@avdgrinten
Copy link
Contributor

We probably want to defer this PR until the next release is out. There are too many open questions to merge it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graph::indexEdges() is quadratic in node degrees EdgeListReader is quadratic in node degrees
5 participants