Refactor to support Apache Arrow #307

rupurt · 2022-11-30T18:01:42Z

Howdy,

I read your thesis on the rust-rewrite branch. Very informative and you have obviously put a lot of thought into the tool. As you noted in Possible Improvements and Lessons Learned columnar format and Apache Arrow seem to be the state of the art way forward.

Do you have any plans to support Arrow or have you thought about what would be some required changes in your engine to make it work? I see a tonne of potential in making OctoSQL a general purpose go cli + embedded serving layer in something like the Kappa architecture. I've already read through much of the source and will start contributing PR's for general improvements but would like to know if I should fork and create a separate project or try to refactor your current code base over time to support Arrow.

The text was updated successfully, but these errors were encountered:

cube2222 · 2022-12-03T00:56:08Z

Hey @rupurt!

I'm not planning to support Apache Arrow nor port OctoSQL to a vectorised execution engine. It's not worth the effort. Re the thesis - it's old and OctoSQL has been rewritten from scratch since. It's around 100x faster now, due to the static typing and how the execution phase now works.

In general, if you're working with data where the speedup of a columnar engine would be worth it, just use https://github.com/apache/arrow-datafusion or a project built around it. It has much more manpower behind it. Arrow is a PITA to code around, especially when you want union types, repetition, and deeply nested data structures. Additionally, the Go Arrow library is way behind the Rust one (or others).

To answer your last question, if you'd like to port OctoSQL to Arrow, please fork. As far as improvements go, they're welcome! However, please first create issues to discuss the details of the contributions.

If you'd like to attempt this redesign on your own, your best bet is to keep the physical phase but rewrite the execution phase almost completely. Here's an experiment you can use as inspiration: https://github.com/cube2222/octosql/tree/vectorization-experiment2

rupurt · 2022-12-03T22:26:20Z

Thank you for the info and leads @cube2222. I definitely noticed that you added a dataflow engine which is one of the reasons I'm stoked to comprehend and work with your project!

I'm totally with you in regards to datafusion having a larger community and more progress. But my personal belief is that no one has really nailed the serving layer for small/medium/big data besides maybe Presto which is JVM based. I also believe that once we see the right tool every language will implement a version and go is a sleeping giant with a huge and growing fan base.

Also, sometimes it's just fun to hack on interesting stuff 😄

gedw99 · 2023-04-29T15:41:54Z

Data vision and arrow is certainly cool and has momentum.

Octosql is nice and low ceremony alternative. I prefer octosql. Will try to make PR on things .

@cube2222 would be good to have roadmap and triage issues out to agreed bits of work that people can then work on ??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to support Apache Arrow #307

Refactor to support Apache Arrow #307

rupurt commented Nov 30, 2022

cube2222 commented Dec 3, 2022

rupurt commented Dec 3, 2022

gedw99 commented Apr 29, 2023

Refactor to support Apache Arrow #307

Refactor to support Apache Arrow #307

Comments

rupurt commented Nov 30, 2022

cube2222 commented Dec 3, 2022

rupurt commented Dec 3, 2022

gedw99 commented Apr 29, 2023