Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata time in queries with Qbeast Datasource is higher than expected #320

Open
osopardo1 opened this issue Apr 23, 2024 · 0 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@osopardo1
Copy link
Member

Investigating in the Spark UI with simple queries, we detected that the Metadata time for Qbeast datasource is bigger than expected.

Here's a comparison of a small (10 element) dataset read with Delta and Parquet:

Parquet

image

Delta

image

Qbeast

image

While Delta an Parquet spent only 2ms on Metadata time, Qbeast wasted 593ms. And this is for a small dataset, but the situation could get worsen specially in high-append scenarios.

I've checked the Execution Plan and the configuration, and does not seem to have much difference asides from the Index used.

  • For Parquet, an InMemoryFileIndex is initialized.
  • For Delta, a PreparedDeltaFileIndex is initialized.
  • For Qbeast a DefaultFileIndex is initialized.

Further investigation is needed. Will keep the conversation going on this issue.

@osopardo1 osopardo1 added the bug Something isn't working label Apr 23, 2024
@osopardo1 osopardo1 self-assigned this Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant