Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to support Apache Arrow #307

Open
rupurt opened this issue Nov 30, 2022 · 3 comments
Open

Refactor to support Apache Arrow #307

rupurt opened this issue Nov 30, 2022 · 3 comments

Comments

@rupurt
Copy link

rupurt commented Nov 30, 2022

Howdy,

I read your thesis on the rust-rewrite branch. Very informative and you have obviously put a lot of thought into the tool. As you noted in Possible Improvements and Lessons Learned columnar format and Apache Arrow seem to be the state of the art way forward.

Do you have any plans to support Arrow or have you thought about what would be some required changes in your engine to make it work? I see a tonne of potential in making OctoSQL a general purpose go cli + embedded serving layer in something like the Kappa architecture. I've already read through much of the source and will start contributing PR's for general improvements but would like to know if I should fork and create a separate project or try to refactor your current code base over time to support Arrow.

@cube2222
Copy link
Owner

cube2222 commented Dec 3, 2022

Hey @rupurt!

I'm not planning to support Apache Arrow nor port OctoSQL to a vectorised execution engine. It's not worth the effort. Re the thesis - it's old and OctoSQL has been rewritten from scratch since. It's around 100x faster now, due to the static typing and how the execution phase now works.

In general, if you're working with data where the speedup of a columnar engine would be worth it, just use https://github.com/apache/arrow-datafusion or a project built around it. It has much more manpower behind it. Arrow is a PITA to code around, especially when you want union types, repetition, and deeply nested data structures. Additionally, the Go Arrow library is way behind the Rust one (or others).

To answer your last question, if you'd like to port OctoSQL to Arrow, please fork. As far as improvements go, they're welcome! However, please first create issues to discuss the details of the contributions.

If you'd like to attempt this redesign on your own, your best bet is to keep the physical phase but rewrite the execution phase almost completely. Here's an experiment you can use as inspiration: https://github.com/cube2222/octosql/tree/vectorization-experiment2

@rupurt
Copy link
Author

rupurt commented Dec 3, 2022

Thank you for the info and leads @cube2222. I definitely noticed that you added a dataflow engine which is one of the reasons I'm stoked to comprehend and work with your project!

I'm totally with you in regards to datafusion having a larger community and more progress. But my personal belief is that no one has really nailed the serving layer for small/medium/big data besides maybe Presto which is JVM based. I also believe that once we see the right tool every language will implement a version and go is a sleeping giant with a huge and growing fan base.

Also, sometimes it's just fun to hack on interesting stuff 😄

@gedw99
Copy link

gedw99 commented Apr 29, 2023

Data vision and arrow is certainly cool and has momentum.

Octosql is nice and low ceremony alternative. I prefer octosql. Will try to make PR on things .

@cube2222 would be good to have roadmap and triage issues out to agreed bits of work that people can then work on ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants