Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Open
ilyanoskov opened this issue Feb 5, 2024 · 1 comment
Labels
Arrow Apache Arrow support

Comments

@ilyanoskov
Copy link

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

@auxten
Copy link
Member

auxten commented Feb 6, 2024

It's discussed on #187. I'm working on it.

@auxten auxten added the Arrow Apache Arrow support label Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow Apache Arrow support
Projects
Status: Todo
Development

No branches or pull requests

2 participants