Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Blocks addressable from the file reader #322

Open
4 tasks
osopardo1 opened this issue Apr 25, 2024 · 0 comments
Open
4 tasks

Make Blocks addressable from the file reader #322

osopardo1 opened this issue Apr 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@osopardo1
Copy link
Member

From v0.6.0 onwards, the structure of the Table is composed by files that contain multiple blocks, each of them belonging to the same or different cubes. This is part of the Multiblock format, that allowed Qbeast to balance the file layout without losing indexing benefits.

Now, blocks help us locate a particular cube on the file, but a single block is not addressable/retrievable from the spark reader. Although we are using Delta File Skipping to discard data based on min/max, we are not supporting such fine-grained search when Sampling is applied.

This change requires some work regarding #175 . Datasource V2 is more extensible and allows us to implement our reader. In this case, the reader should be designed to skip entire groups of rows based on the block number.

PS: This is something that @alexeiakimov had tried in previous issues, but some other priorities were raised.

TODOs:

  • Analyze how to make blocks addressable from a Parquet File.
  • Implement Datasource V2 for Qbeast
  • Make a PoC
  • Develop the feature and test
@osopardo1 osopardo1 added the enhancement New feature or request label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant