Super Node Series One: Introduce proxy node type #4420

li-boxuan · 2024-04-29T07:12:37Z

This is a first attempt towards solving a famous problem in graph databases world - super node. A brief introduction to super node problem, and the status quo of JanusGraph, is documented at #2717. In a nutshell, JanusGraph has already partially addressed the traversal performance issue, but not the memory/storage issue.

Without native support of super node in JanusGraph (and many other graph databases), a lot of users end up building their own workarounds. Probably the most popular approach is to create "meta" vertices, or "proxy" vertices, and let the application layer redistributes the traversals & data, which is cumbersome and error-prone.

This PR aims to introduce proxy nodes, which work like the partition node before, except that proxy vertices are created only on-demand. There are two basic requirements:

User intervention should be as minimal as possible.
This feature shall bring zero or minimal overhead when there's no super node.

Note that the existing (but discouraged) partition node (a.k.a. vertex cut feature), addresses the 1st problem well but performs very bad at the 2nd requirement.

It takes a few steps to fully address 1st requirement, and this PR only addresses it partially. This PR requires users to explicitly create proxy nodes, connect them with the canonical node, and EXPLICITLY connect edges to the proxy nodes. This mostly fulfills the 2nd requirement: when there's no need, don't introduce new overhead. The drawback is that the write path is pretty cumbersome, but the good news is that, the read path offers seamless experience. Users could do normal traversal queries as if proxy nodes don't exist.

The brief design is as follows: let's say A is a super node, and A connects to a number of vertices with different labels. Let's say we have a proxy node for A, Vpa. We store proxies, an array of IDs, as a vertex property in the canonical node Va. Conversely, we store canonicalId, the ID of Va as a vertex property in Vpa. Every time we need to traverse from Va, we always fetch Vpa, and do the traversal from there. Let's say Va -.-.-.-> Vpa -------> Vb, then when we traverse from Vb, we will find Vpa first, and then we retrieve Va because Vpa is just a proxy for Va.

TODOs in this PR:

TODOs in subsequent PRs:

Add traversal strategy such that user doesn't need to explicitly connect edges to proxy nodes
Automatically create proxy nodes (experimental)

C.C. @dxtr-1-0 @rngcntr who expressed interests in this project

Thank you for contributing to JanusGraph!

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there an issue associated with this PR? Is it referenced in the commit message?
Does your PR body contain #xyz where xyz is the issue number you are trying to resolve?
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you written and/or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE.txt file, including the main LICENSE.txt file in the root of this repository?
If applicable, have you updated the NOTICE.txt file, including the main NOTICE.txt file found in the root of this repository?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

li-boxuan · 2024-05-04T02:09:31Z

This work was created by me long time ago (before I joined my current company). Looks like to do any non-trivial contribution to open-source projects, I need to register. Thus, I'll close this PR for now and get back to it once I get approval.

li-boxuan · 2024-05-10T04:28:03Z

Yeah it looks like I can continue contributing to JanusGraph.

This aims to fix the super node problem by introducing proxy nodes (which works like the partition node before), except that we introduce proxy nodes in application layer only when needed. This commit focuses on how to make users a seamless use experience as if the proxy node does not exist. What is not clear right now is, if A is a super node, and A connects to a number of vertices with different labels (and possibly different properties like timestamp). Let's say we have a proxy node for A, Vpa, then how many edges should we create between Va and Vpa? 1) Create 0 edge between Va and proxies. We store id(Vpa) as a vertex property in Va. Every time we need to traverse from Va, we always fetch Vpa, and do the traversal from there. Let's say Va -.-.-.-> Vpa -------> Vb, then when we traverse from Vb, we will find Vpa first, and then we retrieve Va because Vpa is just a proxy for Va. This is very similar if not exactly the same as vertex-cut partition in JanusGraph. 2) Create 1 edge between Va and each Vpa. No specific benefit; drawback is, this edge will be queried, which means we need to remove this edge. 3) Create 1 edge per label between Va and and each Vpa. Benefit is that for queries with label constraint that does not apply here, we don't even need to traverse Vpa because it won't be found with this label. Drawback is also very obvious, if Vpa ---label1---> Vb1, and Vpa ---label2---> Vb2, then if we traverse from Vb1 to Va, we will see two edges, unless we somehow see that we use label1 to go from Vb1 to Vpa thus we shall only use label1 to go from Vpa to Va. This is very cumbersome so we shall not use it. This commit actually uses 3) which makes it very difficult to pass all tests. 4) Create one proxy node for a particular edge type (label + props). Suppose edges are all very similar except for a few properties. Then for each unique edge type, we create a proxy node. Benefit is we can fully utilize VCI because these edges actually represent all physical edges. Drawback is this is very application-specific. For example, if we only need rundate in traversal, then one or more proxy nodes will represent a particular rundate. If original query is g.V(Va).outE().has("rundate", "20210512").inV(), then we can utilize VCI for rundate on Va. Note that there is only 1 edge between Va and each proxy node. Another big challenge of 2, 3, and 4 is how to control edge traversal. If Vpa connects to Vb1 and Vb2, we must avoid wrong traversal from Vb1 to Vb2 via Vpa (when the user aims to traverse from Vb1 to Va), which is difficult in practice. This commit uses option (1): Create proxy nodes but don't draw connection between canonical node and proxy node Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>

janusgraph-bot added the cla: external Externally-managed CLA label Apr 30, 2024

li-boxuan closed this May 4, 2024

li-boxuan reopened this May 10, 2024

li-boxuan force-pushed the super-node-proxy branch from b7f9ef8 to aa9791f Compare May 14, 2024 07:26

li-boxuan marked this pull request as draft May 14, 2024 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super Node Series One: Introduce proxy node type #4420

Super Node Series One: Introduce proxy node type #4420

li-boxuan commented Apr 29, 2024 •

edited

li-boxuan commented May 4, 2024

li-boxuan commented May 10, 2024

Super Node Series One: Introduce proxy node type #4420

Are you sure you want to change the base?

Super Node Series One: Introduce proxy node type #4420

Conversation

li-boxuan commented Apr 29, 2024 • edited

For all changes:

For code changes:

For documentation related changes:

li-boxuan commented May 4, 2024

li-boxuan commented May 10, 2024

li-boxuan commented Apr 29, 2024 •

edited