Metadata time in queries with Qbeast Datasource is higher than expected #320

osopardo1 · 2024-04-23T09:38:20Z

Investigating in the Spark UI with simple queries, we detected that the Metadata time for Qbeast datasource is bigger than expected.

Here's a comparison of a small (10 element) dataset read with Delta and Parquet:

Parquet

Delta

Qbeast

While Delta an Parquet spent only 2ms on Metadata time, Qbeast wasted 593ms. And this is for a small dataset, but the situation could get worsen specially in high-append scenarios.

I've checked the Execution Plan and the configuration, and does not seem to have much difference asides from the Index used.

For Parquet, an InMemoryFileIndex is initialized.
For Delta, a PreparedDeltaFileIndex is initialized.
For Qbeast a DefaultFileIndex is initialized.

Further investigation is needed. Will keep the conversation going on this issue.

osopardo1 added the bug Something isn't working label Apr 23, 2024

osopardo1 self-assigned this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata time in queries with Qbeast Datasource is higher than expected #320

Metadata time in queries with Qbeast Datasource is higher than expected #320

osopardo1 commented Apr 23, 2024

Metadata time in queries with Qbeast Datasource is higher than expected #320

Metadata time in queries with Qbeast Datasource is higher than expected #320

Comments

osopardo1 commented Apr 23, 2024

Parquet

Delta

Qbeast