Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Nebula Graph #48

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Support Nebula Graph #48

wants to merge 1 commit into from

Conversation

wey-gu
Copy link

@wey-gu wey-gu commented May 13, 2022

Add backend support of Nebula Graph, an Open Source, distributed Graph Database.

Signed-off-by: wey-gu <weyl.gu@gmail.com>
@mgorsk1
Copy link
Contributor

mgorsk1 commented May 16, 2022

Thanks for this RFC @wey-gu

After reading through this I got impression that this proposal is similar to adding RDS or Atlas integration. These have their pluses but serious downside is keeping feature coverage aligned across different backends which has not been great so far.

From what I've got from this + reading through Amundsen PR of actual implementation maybe there is opportunity here that would just mean refactor of neo4j code which would enable us to use same components (but differently configured) with both neo4j and nebula (as they both speak opencypher)? This way nebula integration would always mean exact feature coverage as neo4j and be a nice oss alternative to it.

@wey-gu
Copy link
Author

wey-gu commented May 17, 2022

Thanks @mgorsk1 for your time to look into the proposal!

Indeed, I also had seen yet another backend storage increases the burden introducing new features during the implementation of the reference PR for the proposal, and I just told myself to keep eye on all PRs after it's merged and lift it from my own efforts then.

While, as you pointed, it doesn't scale at all, and it in big chance is a good opportunity to make cypher-based backend with some level of abstractions to share codes when possible.

I will take this context and purpose in mind and see what could be done on the refactor.

There are some challenges that nebula only support OpenCypher as a dialect and reusing query string itself isn't directly possible(see here), while the mindset to per each read functions are similar, thus, find a way to decouple cypher-speaking DB implementation from code to configurations looks possible(and worth it).

Thanks.

@wey-gu
Copy link
Author

wey-gu commented May 24, 2022

Dear @mgorsk1,

After I revisit everything, I think the refactor to best reuse cypher based GDB code is to break the functions in a CypherAbstract proxy class into (one or more pairs)steps of: "querying GDB" and "postprocessing results", with the query_string/template being placed in a separate python file(default as the current neo4j) as variables of the function.

I put some details as follow. What do you think, please? If it's good to go, I could then prepare for a PoC implementation.

Thanks!

Also a appendix in the end on the main differences between nebula and neo4j regarding cypher read queries and driver

CypherAbstract Proxy

  • Neo4j Proxy and Nebula Proxy will inheritance it, with their own drivers, and other unique things

Something will be different compared to those in the current Neo4j proxy:

  • Read Query String/Template will be a variable of functions to enable more cypher-dialect-speaking backend proxy

    • we will put cypher query templates in separate files
    • we will put current neo4j cypher queries as default, with some format changes to make it easier to be lifted to nebula(with a translator ideally), when the translator isn't enough, override of the query string could be done, too
  • Break ReadQuery functions into multiple steps to enable more reuse/override of different proxies

    • step of executinge query
      • could be simply one or more self._execute_cypher_query() or some reusable self._exec_col_query() ones
    • step of postprocessing results
      • this new function enables the most of the execution query parts to be reused

appendix


Major diffs for Query String:

  • key in properties vs. only in WHERE id() clause
    key: $resource_key --> where id(n) == "foo"
- MATCH (f:foo {key: "foo_100"}) RETURN f
+ MATCH (f:foo) WHERE id(f) == "foo_100" RETURN f
  • RETURN/WITH key vs id()
    foo.key --> id(foo)
- MATCH (f:foo) RETURN f.key
+ MATCH (f:foo) RETURN id(f)
  • The equal sign in WHERE Clause
    a = b --> a == b
- MATCH (f:foo) WHERE 1 = 1 RETURN f
+ MATCH (f:foo) WHERE 1 == 1 RETURN f
  • Prop for vertex need to explicitly provide tag/label name
    table.name --> table.Table.name
- MATCH (f:foo) RETURN f.name
+ MATCH (f:foo) RETURN f.foo.name
  • There are keywords to be escaped with "`"
- MATCH (user:User) RETURN user
+ MATCH (`user`:`User`) RETURN `user`

Major diffs in Result

  • single() is not supported in Nebula
  • record['key_name'] vs recordget_value_by_key('key_name')
  • record['col']['col_type'] vs record.get_value_by_key('col').as_node().properties("Column")['col_type'] or record['row'][i_col]['Column.col_type'] *
    - * note nebula supports execute() and execute_json(), in PoC PR, execute_json() was used, which may not be a good idea

@wey-gu
Copy link
Author

wey-gu commented Jun 7, 2022

Dear @mgorsk1, what do you think, please?

PS. in case the refactoring is not feasible, I will try my best to maintain the nebula backend in my free time, I really like Amundsen and am willing to contribute more and more.

@mgorsk1
Copy link
Contributor

mgorsk1 commented Jun 8, 2022

Hey @wey-gu apologies for late reply & thanks for very detailed and thorough explanation. It's very much appreciated to have such eager community member 🙇

I do think it'd be really useful to have Nebula as an alternative especially given that lot's of goodies in neo4j is only available in commercial version.

As for generalizing more it makes sense to have abstract cypher proxy with separate implementations taking into account these subtle differences. Maintenance of metadata proxy indeed should not be that much of a hassle - what's much more important in my view is we can reuse GraphSerializable models for Nebula. Extractors, Publishers & Loaders are usually one-time effort to prepare (unless you see a way to also reuse those). The pain starts when you need to catch up to GraphSerializable.

I would still love some input from @feng-tao @dkunitsk or anyone who was involved in original neo4j implementation to see if this level of refactoring would be acceptable.

@Golodhros
Copy link
Member

Hey @wey-gu, what is the cost of using this database?
If we agree to have this, would you be open to implement this?

@wey-gu
Copy link
Author

wey-gu commented Dec 9, 2022

Dear @Golodhros

Thanks❤!

what is the cost of using this database?

I think the costs(to users, if we are talking about this) are:

  • NebulaGraph is schema-ful(trade-off between flexibility & performance), so the mindset should be changed for using switching from neo4j, while this could be covered by the data-builder of NebulaGraph(I had implemented that on checking schema and making needed changes in the PoC, which could be refined though)
  • NebulaGraph is distributed, and there are costs here, as well, it's not as lightweight as neo4j(single process), but we've got HA, better perf/throughput, etc.
  • There is no cypher APOC equivalent implementation in NebulaGraph for now(there is a downstream user willing to upstream similar capabilities in near future though)
  • NebulaGraph is a relatively young project, but the community is quite active and it's getting gradually maturer. And it's already trusted by quite some of the teams

If we agree to have this, would you be open to implement this?

Thanks so much, I'll be happy/honored to implement this.

BR//Wey

@Golodhros
Copy link
Member

What do you think @feng-tao, @kristenarmes and @allisonsuarez ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants