Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How it is different with Datafuse? #207

Open
alexey-milovidov opened this issue Jul 19, 2021 · 8 comments
Open

How it is different with Datafuse? #207

alexey-milovidov opened this issue Jul 19, 2021 · 8 comments

Comments

@alexey-milovidov
Copy link

alexey-milovidov commented Jul 19, 2021

In terms of general product strategy, direction?
These projects look very similar, I'm wondering how do they differ.

PS. Best wishes from ClickHouse team :)

@jinmingjian
Copy link
Contributor

jinmingjian commented Jul 20, 2021

@alexey-milovidov Alex! Really appreciated for your great wishes from ClickHouse team!

In terms of general product strategy, direction?

In terms of general product strategy, there are many information not released indeed. From the detailed tech implementation, ClickHouse(CH, for short), of course, is our great teacher. But, furthermore, TensorBase(TB, for short) also wants to systematically correct the problems shown in ClickHouse and from the view of my seven-year experiences of bigdata system development and operation.

These projects look very similar, I'm wondering how do they differ.

Datafuse(DFe, for short) is a sad topic. I am used to invite the team of DFe several months ago. But they've got good memory from local ventures. The invitation was not accepted.

Before this invitation, the project is licensed under a commercially unfriendly license AGPL. Personally, I do not lookup any source of commercially unfriendly licensed project. Before this invitation, I have built one prototype of pure Rust engine based on the Arrow and DataFusion. And it works greatly and become the new initial base of TB as you seen.

After the time of invitation, I evaluated the DFe which has been changed to the APL licence. The conclusion is, basically there is nothing new in DFe except for a good engineering structure (but a nice engineering structure is a low-hanging fruit in my humble opinion):

  1. Toy engine with misleading benchmarks on top of toy dataset
  2. Misleading name: DataFuse v.s. DataFusion?
  3. Its route: Arrow + its own engine. But, its engine, in fact, reinvents the wheel of DataFusion which is now the base of TensorBase.
  4. DFe also repeats the old story in other respects from the history IMHO.

In TensorBase, we have a heart of change. This is the fundamental difference to DFe from TB.

  1. Repeating makes no sense.
    DataFusion is already a repeating to existed engine story: Spark, Presto, Impala ... and our ClickHouse (orz, CH is not only an engine). To use raft library is repeating. To have new hash-join operator with a repeating hashmap library is repeating. To clone a CH's MergeTree engine is repeating.

    Do we have thought: is raft great for the progress of this era? is hash-join great for the progress of this era? is MergeTree great for the progress of this era?

    Do you think: a CH Rust-clone makes sense for the world?

    TensorBase's answer is: No.

  2. Help to the community rather than reinvention-the-wheel unless we have new.
    TB chooses to help all the existed community, and DFe chooses to repeat an old thing in its new codes in a little misleading naming 😄


Back to the first question, TensorBase, in fact, has pinned a very different way to ClickHouse and of course to CH's clones :

  • TB will not use LSM or Mergetree. Your mentioned DFe even has no storage now. TB has more efficient Partition-Tree based storage than CH. (soon we demonstrate more.)
  • TB will not use the raft protocol for distribution. CH and your mentioned DFe used. But we want to show we still do not compromise the consistency and top performance. (soon)
  • TB will soon show our engine optimization fruit on top of DataFusion, for which we show to challenge more of CH performance in real world dataset.
  • TB propose a data-control subsystem from its day one to conquer the security and privacy of bigdata.
  • TB has a revolutionary engine as you seen in the frontier edition. In that engine, we have different scheduler, runtime which may not exist in the world, at least in the open-source world. I hope to bring it back to the community to service for this era like CH done currently.
  • ...

Several local ventures ask me a question: if DFe and similar projects copy your codes, then how about you? I have no idea. Your CH has been copied everywhere but it has Yandex backed. This is not the case for TB.

So, for your first question, my current answer is: we reveal them when they come. It is hoped that, if they copy, the world knows that things are copied from TB.

Finally, I am not sure, if there is an opportunity to work with the Yandex/CH team in the next-generation data warehouse from the new engineering of TB. TB now is a good friend in CH ecosystem. TB is open for any possibility.

@BohuTANG
Copy link

@alexey-milovidov
Thanks for the question.
I like ClickHouse style:

  1. the product must solve actual problem
  2. and do it better than others

@jinmingjian
Hi, Thanks for the comments.
But your comments are unfair and you don't even understand what we are doing.
Datafuse is a Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture.

Comments are cheap, show me the product and solve actual problem!

@drmingdrmer
Copy link

  1. Do we have thought: is raft great for the progress of this era?

@jinmingjian
I've been working on the storage layer of DataFuse and am so glad that there are some other people found that raft is an old fashion toy and has been looking for something new and adaptive to the latest cloud oriented environment.

Raft is the choice right now only because it provides a well defined engineering architecture, and is easy to use for building a prototype, not because it has any advantages over any of other consensus protocols.

We've been working on something new that adapts to large scale cross-DC and cross-cloud deployment.
The new protocol takes exactly only 1 RTT to commit a message in a cluster of 3 DC(and at most 5 DC).

I can't agree more that repeating something is quite boring. Creating is the only thing interested me.

And we can't wait to share with the community about what we created, what we did right and what we did wrong. :DDD

@jinmingjian
Copy link
Contributor

But your comments are unfair and you don't even understand what we are doing.
Datafuse is a Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture.

Comments are cheap, show me the product and solve actual problem!

Hi, @BohuTANG , take it easy:) My comment is truly biased on my understanding. Alex or more people asks this problem because they feel to find common things. As the author, I explain the difference from my view. So, this is just my own opinion.

I may do not understand what you are doing. But modern, real-time, data, analytics, cloud are also what TB is pursuing. We are sure to have our own understanding on these. It is no problem to have two or more open source projects on one direction. And the authors have the rights to decide how to build their projects. I do not say you are wrong. I just say I do not do the things like yours.

@jinmingjian
Copy link
Contributor

We've been working on something new that adapts to large scale cross-DC and cross-cloud deployment.
The new protocol takes exactly only 1 RTT to commit a message in a cluster of 3 DC(and at most 5 DC).

Thanks for sharing, raft is still good, and welcome your new protocol:) I just show the different thinkings of TB. But "comment is cheap". I just leave the answer for time.

I can't agree more that repeating something is quite boring. Creating is the only thing interested me.

good wishes for you:)

@alexey-milovidov
Copy link
Author

Ok. I will just keep an eye on both of these projects.
I will look for the ideas and share my knowledge. I hope it will benefit all the community.

@alexey-milovidov
Copy link
Author

@jinmingjian BTW, I have tested Datafuse,
you may find this interesting: ClickHouse/ClickHouse#27510

@jinmingjian
Copy link
Contributor

@alexey-milovidov Alex, thanks for feedback and benchmark sharing! I have seen too many such things in the local community, unfair comparison is just one of them. These operations are only destructive to the open source community. As a member of entire open source community, I also hope everyone in the community could respect innovations to help rather than destruct the community.

@jinmingjian jinmingjian pinned this issue Nov 29, 2021
@jinmingjian jinmingjian unpinned this issue Nov 29, 2021
@jinmingjian jinmingjian pinned this issue Dec 18, 2021
@jinmingjian jinmingjian unpinned this issue Dec 18, 2021
@jinmingjian jinmingjian pinned this issue Jan 4, 2022
@jinmingjian jinmingjian unpinned this issue Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants