[#3264] feat(spark-connector): Support Iceberg time travel in SQL queries #3265

caican00 · 2024-05-04T06:46:34Z

What changes were proposed in this pull request?

Support Iceberg time travel in SQL queries

Why are the changes needed?

supports time travel in SQL queries using TIMESTAMP AS OF, FOR SYSTEM_TIME AS OF or VERSION AS OF, FOR SYSTEM_VERSION AS OF clauses.

Fix: #3264

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New ITs.

… to iceberg Table

…in SQL queries

caican00 · 2024-05-15T07:42:27Z

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.
However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name.
https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

caican00 · 2024-05-15T07:43:02Z

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.

However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

cc @FANNG1

FANNG1 · 2024-05-16T00:32:07Z

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

caican00 · 2024-05-16T02:39:31Z

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

FANNG1 · 2024-05-16T04:04:17Z

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

Let me think a while

...rk-connector/src/main/java/com/datastrato/gravitino/spark/connector/hive/SparkHiveTable.java

.../src/main/java/com/datastrato/gravitino/spark/connector/iceberg/GravitinoIcebergCatalog.java

...rk-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

...nector/src/main/java/com/datastrato/gravitino/spark/connector/iceberg/SparkIcebergTable.java

...test/java/com/datastrato/gravitino/integration/test/spark/iceberg/SparkIcebergCatalogIT.java

caican00 · 2024-05-17T10:23:34Z

fixed conflict, and could you please help review again if you are free? Thank you @FANNG1

.../src/main/java/com/datastrato/gravitino/spark/connector/iceberg/GravitinoIcebergCatalog.java

...rk-connector/src/main/java/com/datastrato/gravitino/spark/connector/catalog/BaseCatalog.java

FANNG1 · 2024-05-20T02:05:58Z

LGTM, except minor comments

caican00 · 2024-05-20T03:41:22Z

LGTM, except minor comments

comments have been addressed, and could you please help review again? @FANNG1

…ries (#3265) ### What changes were proposed in this pull request? Support Iceberg time travel in SQL queries ### Why are the changes needed? supports time travel in SQL queries using `TIMESTAMP AS OF`, `FOR SYSTEM_TIME AS OF` or `VERSION AS OF`, `FOR SYSTEM_VERSION AS OF` clauses. Fix: #3264 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New ITs.

FANNG1 · 2024-05-20T05:54:04Z

@caican00 , thanks for your contribution!

[datastrato#2543] feat(spark-connector): support row-level operations…

4d334aa

… to iceberg Table

caican00 marked this pull request as draft May 4, 2024 06:58

[datastrato#3264] feat(spark-connector): Support Iceberg time travel …

90b7be8

…in SQL queries

caican00 force-pushed the iceberg-asof branch from 1503864 to 90b7be8 Compare May 4, 2024 15:31

update

65ef2a4

caican00 force-pushed the iceberg-asof branch from 38be2ae to 65ef2a4 Compare May 5, 2024 15:14

Merge branch 'main' of github.com:datastrato/gravitino into iceberg-asof

302244b

caican00 force-pushed the iceberg-asof branch from 84a345a to 0de7298 Compare May 15, 2024 06:47

caican00 marked this pull request as ready for review May 15, 2024 06:48

update

90b8d14

caican00 force-pushed the iceberg-asof branch from 0de7298 to 90b8d14 Compare May 15, 2024 07:03

Merge branch 'main' into iceberg-asof

86f51cd

caican00 marked this pull request as draft May 15, 2024 08:02

caican00 marked this pull request as ready for review May 15, 2024 08:02