CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

ilyanoskov · 2024-02-05T19:18:01Z

I have recently had this case, where I had to process a Pandas dataframe with 70M rows that had 5 simple columns and used window functions and GROUP BY operations.

After saving this data to CSV / Parquet and then processing it, CHDB was able to compute the results in 4-5 seconds, and when operating over Arrow, it took close to 30 seconds.

Steps to reproduce this are simple: create a dataframe with random data over 5 columns (id, time, val1, val2, val3) for 70M rows and then perform complex GROUP BY / WINDOW operations. Then save this dataframe to a file and perform the same queries over the file. You will see that the performance is significantly faster.

I would imagine that working with Arrow dataframes would be faster, since accessing memory is faster than accessing disk?

auxten · 2024-02-06T04:21:00Z

It's discussed on #187. I'm working on it.

auxten added the Arrow Apache Arrow support label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

ilyanoskov commented Feb 5, 2024

auxten commented Feb 6, 2024

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Comments

ilyanoskov commented Feb 5, 2024

auxten commented Feb 6, 2024