-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DatasourceV2: Sampling Pushdown and Limit Pushdown [Spark] #175
Comments
In DatasourceV2 there's also the possibility to build your own scan of the table, with more options than the Datasource V1 (which we are currently using). Maybe it's worth to explore the DV2 API. |
IMHO we need to explore the Datasource V2 API, possibly we will end to drop the V1. To support both can be too much conditional logic. |
Yes, I agree. Do you think this can be done in the same PR #167 or it is best to do workaround first for Sampling and Limit Pushdown and migrate everything to V2 in a separate issue? |
Well, I prefer to separate migration to the new versions of Spark/Delta and reworking the QbeastTable on the top of DataSource V2. So migration would mean that everything is compiling and running without new problems. And rework is a complex task, because they changed DataSource SPI a lot although it still has version V2. A good overview of the Spark 3.0 SPI could be found here https://blog.madhukaraphatak.com/categories/datasource-v2-spark-three/ |
I would like to share some thought on Spark 3.x.x DataSource API V2.
@osopardo1, @cugni , @Jiaweihu08 Could it make sense to create a temporary DataFrame to copy the data being written, and then to apply the algorithm we use now? |
Thank you for the overview!
|
Noted. We are going to merge #167 first and then migrate to V2. We can also split the development of migration in two:
|
Technically I prefer 4 steps:
Probably small changes will be easier to review and to demonstrate |
Plan looks good to me. 👍 |
I maintain this issue for future development plans. We need to rethink the design, the utility, and the properties involved. |
Related to #166 .
Qbeast-Spark should be compatible with latest versions of Delta Lake and Apache Spark, to benefit from any new features and major upgrades.
The change to Delta version 2.1.0 and Spark 3.3.0, reveal a set of interesting Pushdown operation that could be empowered with the Qbeast Metadata.
We should:
QbeastTableImpl
).QbeastTableImpl
.QbeastSparkSessionExtension
. The deletion of Sample Optimisation would also affect Overhead of qbeast_hash filtering when doing a Sample #68 . This requieres some more insights.*Thoughts on #68
OTreeIndex
(or from any other class that is involved, such asParquetFileFormat
) is correct.minWeight
andmaxWeight
, we can determine how many rows we can read from it. The only thing we need to find out is where those records are filtered.The text was updated successfully, but these errors were encountered: