Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175

osopardo1 · 2023-03-16T14:59:50Z

Related to #166 .

Qbeast-Spark should be compatible with latest versions of Delta Lake and Apache Spark, to benefit from any new features and major upgrades.
The change to Delta version 2.1.0 and Spark 3.3.0, reveal a set of interesting Pushdown operation that could be empowered with the Qbeast Metadata.

DS V2 Sample Push Down (SPARK-37038)
DS V2 LIMIT push down (SPARK-37020)

We should:

Merge PR Spark 3.3.0 delta 2.1.0 #167
Migrate V1 Reader to V2 (QbeastTableImpl).
Integrate Sample Pushdown properties into QbeastTableImpl.
Eliminate the Sample optimisation rule from QbeastSparkSessionExtension. The deletion of Sample Optimisation would also affect Overhead of qbeast_hash filtering when doing a Sample #68 . This requieres some more insights.*
Integrate the Limit Pushdown.
Migrate V1 Writer to V2 (if needed).

Thoughts on #68

Ideally, we would no longer need to compute the hash for the indexed columns in order to filter the records in memory. But then, we need to ensure the sample returned from OTreeIndex (or from any other class that is involved, such as ParquetFileFormat) is correct.
Since for each file we know the minWeight and maxWeight, we can determine how many rows we can read from it. The only thing we need to find out is where those records are filtered.
Here's a detailed blog that explains how Spark process Parquet files: https://animeshtrivedi.github.io/spark-parquet-reading

The text was updated successfully, but these errors were encountered:

osopardo1 · 2023-03-16T16:45:55Z

In DatasourceV2 there's also the possibility to build your own scan of the table, with more options than the Datasource V1 (which we are currently using).

Maybe it's worth to explore the DV2 API.

alexeiakimov · 2023-03-16T20:56:52Z

IMHO we need to explore the Datasource V2 API, possibly we will end to drop the V1. To support both can be too much conditional logic.

osopardo1 · 2023-03-17T08:08:45Z

Yes, I agree. Do you think this can be done in the same PR #167 or it is best to do workaround first for Sampling and Limit Pushdown and migrate everything to V2 in a separate issue?

alexeiakimov · 2023-03-17T09:12:23Z

Well, I prefer to separate migration to the new versions of Spark/Delta and reworking the QbeastTable on the top of DataSource V2. So migration would mean that everything is compiling and running without new problems. And rework is a complex task, because they changed DataSource SPI a lot although it still has version V2. A good overview of the Spark 3.0 SPI could be found here https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/

alexeiakimov · 2023-03-18T20:13:41Z

I would like to share some thought on Spark 3.x.x DataSource API V2.

Surprisingly DataSource API V2 in Spark 2.x and in Spark 3. are different. A good general overview can be found https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/
The Read API seems straightforward, the mixins for filtering, sampling and limit push down should be implemented for the ScanBuilder, the later passes the filters, sampling and limit to the Scan and Batch. Batch can compute the necessary files and pass them to PartitionReaderFactory. The later creates PartitionReader for each partition (we have just one). PartitionReader returns the table rows one by one like iterator.
The Write API is much more challenging. A possible implementation does not have direct access to the original DataFrame, instead the rows are written by DataWriter one by one. In other words DataWriter is a callback, so there is no explicit notification when the writing starts. It means that we cannot assign weights by transforming the original DataFrame as it is done now. It also means that we have to start transaction lazily when we read the current index.

@osopardo1, @cugni , @Jiaweihu08 Could it make sense to create a temporary DataFrame to copy the data being written, and then to apply the algorithm we use now?

osopardo1 · 2023-03-20T08:28:05Z

Thank you for the overview!

Very nice, a lot of code can be reused from OTreeIndex, once the filters and everything is pushed down.
One solution for the Writer API is to keep a Fallback to Version 1. It is what we have implemented for the moment. The Writer Builder returns a V1Write, which will create an InsertableRelation, that calls our methods in IndexedTable for indexing and writing the DataFrame. I think we can migrate just Read features at the moment, while we consider moving everything else in the future.

osopardo1 · 2023-03-20T09:11:28Z

Well, I prefer to separate migration to the new versions of Spark/Delta and reworking the QbeastTable on the top of DataSource V2. So migration would mean that everything is compiling and running without new problems. And rework is a complex task, because they changed DataSource SPI a lot although it still has version V2. A good overview of the Spark 3.0 SPI could be found here https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/

Noted. We are going to merge #167 first and then migrate to V2. We can also split the development of migration in two:

Migrate Read Features (plus add Sampling Pushdown and Limit)
Migrate Writer (if needed)

alexeiakimov · 2023-03-20T09:23:40Z

Technically I prefer 4 steps:

Implement Read API V2 to have a working pipeline.
Add sampling push down
Add limit push down
Implement Write API V2 falling back to V1Write.

Probably small changes will be easier to review and to demonstrate

osopardo1 · 2023-03-20T09:38:29Z

Plan looks good to me. 👍

…irements.

…ration issues

…etScanBuilder

osopardo1 · 2023-10-23T10:03:26Z

I maintain this issue for future development plans. We need to rethink the design, the utility, and the properties involved.

osopardo1 added the enhancement New feature or request label Mar 16, 2023

osopardo1 assigned alexeiakimov Mar 16, 2023

osopardo1 added status: in-progress This issue is in progress priority: normal This issue has normal priority labels Mar 17, 2023

osopardo1 mentioned this issue Mar 20, 2023

Spark 3.3.0 delta 2.1.0 #167

Merged

3 tasks

osopardo1 self-assigned this Mar 20, 2023

This was referenced Mar 20, 2023

Overhead of qbeast_hash filtering when doing a Sample #68

Open

Support for changing indexed columns on append #172

Open

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 24, 2023

Qbeast-io#175 initial implementation of Read API V2

7862307

alexeiakimov mentioned this issue Mar 24, 2023

#175 initial implementation of Read API V2 #178

Closed

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 27, 2023

Qbeast-io#175 Scaladoc and copyright fixes according to PR review

16562de

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 27, 2023

Qbeast-io#175 Copyright headers are reverted to meet the checker requ…

76deabf

…irements.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 27, 2023

Qbeast-io#175 Links are removed from scaladocs to avoid scaladoc gene…

de5c309

…ration issues

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 29, 2023

Qbeast-io#175 Read API implementation is reworked on the top of Parqu…

55e88ed

…etScanBuilder

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Mar 29, 2023

Qbeast-io#175 Some fixes for QbeastSamplingTest

3532481

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Apr 3, 2023

Qbeast-io#175 Fixes to filter pushdown test

c1fff3c

osopardo1 changed the title ~~Support for Spark 3.3.x Sampling Pushdown and Limit Pushdown~~ Support for DatasourceV2: Sampling Pushdown and Limit Pushdown Sep 21, 2023

osopardo1 changed the title ~~Support for DatasourceV2: Sampling Pushdown and Limit Pushdown~~ Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] Sep 21, 2023

osopardo1 added status: on-hold This issue is on hold and removed status: in-progress This issue is in progress priority: normal This issue has normal priority labels Sep 21, 2023

osopardo1 removed the status: on-hold This issue is on hold label Oct 23, 2023

This was referenced Apr 23, 2024

Issue 309: Update documentation for 0.6.0 release #310

Merged

Make Blocks addressable from the file reader #322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175

Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175

osopardo1 commented Mar 16, 2023 •

edited

osopardo1 commented Mar 16, 2023

alexeiakimov commented Mar 16, 2023

osopardo1 commented Mar 17, 2023

alexeiakimov commented Mar 17, 2023 •

edited

alexeiakimov commented Mar 18, 2023

osopardo1 commented Mar 20, 2023 •

edited

osopardo1 commented Mar 20, 2023

alexeiakimov commented Mar 20, 2023 •

edited

osopardo1 commented Mar 20, 2023

osopardo1 commented Oct 23, 2023

Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175

Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175

Comments

osopardo1 commented Mar 16, 2023 • edited

Thoughts on #68

osopardo1 commented Mar 16, 2023

alexeiakimov commented Mar 16, 2023

osopardo1 commented Mar 17, 2023

alexeiakimov commented Mar 17, 2023 • edited

alexeiakimov commented Mar 18, 2023

osopardo1 commented Mar 20, 2023 • edited

osopardo1 commented Mar 20, 2023

alexeiakimov commented Mar 20, 2023 • edited

osopardo1 commented Mar 20, 2023

osopardo1 commented Oct 23, 2023

osopardo1 commented Mar 16, 2023 •

edited

alexeiakimov commented Mar 17, 2023 •

edited

osopardo1 commented Mar 20, 2023 •

edited

alexeiakimov commented Mar 20, 2023 •

edited