Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super Node Series One: Introduce proxy node type #4420

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

li-boxuan
Copy link
Member

@li-boxuan li-boxuan commented Apr 29, 2024

This is a first attempt towards solving a famous problem in graph databases world - super node. A brief introduction to super node problem, and the status quo of JanusGraph, is documented at #2717. In a nutshell, JanusGraph has already partially addressed the traversal performance issue, but not the memory/storage issue.

Without native support of super node in JanusGraph (and many other graph databases), a lot of users end up building their own workarounds. Probably the most popular approach is to create "meta" vertices, or "proxy" vertices, and let the application layer redistributes the traversals & data, which is cumbersome and error-prone.

This PR aims to introduce proxy nodes, which work like the partition node before, except that proxy vertices are created only on-demand. There are two basic requirements:

  1. User intervention should be as minimal as possible.
  2. This feature shall bring zero or minimal overhead when there's no super node.

Note that the existing (but discouraged) partition node (a.k.a. vertex cut feature), addresses the 1st problem well but performs very bad at the 2nd requirement.

It takes a few steps to fully address 1st requirement, and this PR only addresses it partially. This PR requires users to explicitly create proxy nodes, connect them with the canonical node, and EXPLICITLY connect edges to the proxy nodes. This mostly fulfills the 2nd requirement: when there's no need, don't introduce new overhead. The drawback is that the write path is pretty cumbersome, but the good news is that, the read path offers seamless experience. Users could do normal traversal queries as if proxy nodes don't exist.

The brief design is as follows: let's say A is a super node, and A connects to a number of vertices with different labels. Let's say we have a proxy node for A, Vpa. We store proxies, an array of IDs, as a vertex property in the canonical node Va. Conversely, we store canonicalId, the ID of Va as a vertex property in Vpa. Every time we need to traverse from Va, we always fetch Vpa, and do the traversal from there. Let's say Va -.-.-.-> Vpa -------> Vb, then when we traverse from Vb, we will find Vpa first, and then we retrieve Va because Vpa is just a proxy for Va.

TODOs in this PR:

  • Fix test
  • Support custom-type vertex ID
  • Make it configurable
  • Benchmark
  • Add doc

TODOs in subsequent PRs:

  • Add traversal strategy such that user doesn't need to explicitly connect edges to proxy nodes
  • Automatically create proxy nodes (experimental)

C.C. @dxtr-1-0 @rngcntr who expressed interests in this project


Thank you for contributing to JanusGraph!

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there an issue associated with this PR? Is it referenced in the commit message?
  • Does your PR body contain #xyz where xyz is the issue number you are trying to resolve?
  • Has your PR been rebased against the latest commit within the target branch (typically master)?
  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you written and/or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE.txt file, including the main LICENSE.txt file in the root of this repository?
  • If applicable, have you updated the NOTICE.txt file, including the main NOTICE.txt file found in the root of this repository?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

@janusgraph-bot janusgraph-bot added the cla: external Externally-managed CLA label Apr 30, 2024
@li-boxuan
Copy link
Member Author

This work was created by me long time ago (before I joined my current company). Looks like to do any non-trivial contribution to open-source projects, I need to register. Thus, I'll close this PR for now and get back to it once I get approval.

@li-boxuan li-boxuan closed this May 4, 2024
@li-boxuan
Copy link
Member Author

Yeah it looks like I can continue contributing to JanusGraph.

@li-boxuan li-boxuan reopened this May 10, 2024
This aims to fix the super node problem by introducing proxy
nodes (which works like the partition node before), except that
we introduce proxy nodes in application layer only when needed.
This commit focuses on how to make users a seamless use experience
as if the proxy node does not exist.

What is not clear right now is, if A is a super node, and A connects
to a number of vertices with different labels (and possibly different
properties like timestamp). Let's say we have a proxy node for A, Vpa,
then how many edges should we create between Va and Vpa?

1) Create 0 edge between Va and proxies. We store id(Vpa) as a vertex property
in Va. Every time we need to traverse from Va, we always fetch Vpa, and
do the traversal from there. Let's say Va -.-.-.-> Vpa -------> Vb, then
when we traverse from Vb, we will find Vpa first, and then we retrieve Va
because Vpa is just a proxy for Va. This is very similar if not exactly the
same as vertex-cut partition in JanusGraph.

2) Create 1 edge between Va and each Vpa. No specific benefit; drawback is, this
edge will be queried, which means we need to remove this edge.

3) Create 1 edge per label between Va and and each Vpa. Benefit is that for queries
with label constraint that does not apply here, we don't even need to traverse
Vpa because it won't be found with this label. Drawback is also very obvious,
if Vpa ---label1---> Vb1, and Vpa ---label2---> Vb2, then if we traverse from
Vb1 to Va, we will see two edges, unless we somehow see that we use label1 to
go from Vb1 to Vpa thus we shall only use label1 to go from Vpa to Va. This is
very cumbersome so we shall not use it. This commit actually uses 3) which makes
it very difficult to pass all tests.

4) Create one proxy node for a particular edge type (label + props). Suppose edges are all very
similar except for a few properties. Then for each unique edge type, we create a
proxy node. Benefit is we can fully utilize VCI because these edges actually
represent all physical edges. Drawback is this is very application-specific.
For example, if we only need rundate in traversal, then one or more proxy
nodes will represent a particular rundate. If original query is g.V(Va).outE().has("rundate", "20210512").inV(), then we can utilize VCI for rundate on Va. Note that there is
only 1 edge between Va and each proxy node.

Another big challenge of 2, 3, and 4 is how to control edge traversal. If Vpa connects
to Vb1 and Vb2, we must avoid wrong traversal from Vb1 to Vb2 via Vpa (when the user
aims to traverse from Vb1 to Va), which is difficult in practice.

This commit uses option (1): Create proxy nodes but don't draw connection between canonical node and proxy node

Signed-off-by: Boxuan Li <liboxuan@connect.hku.hk>
@li-boxuan li-boxuan marked this pull request as draft May 14, 2024 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: external Externally-managed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants