Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3264] feat(spark-connector): Support Iceberg time travel in SQL queries #3265

Merged
merged 15 commits into from May 20, 2024

Conversation

caican00
Copy link
Contributor

@caican00 caican00 commented May 4, 2024

What changes were proposed in this pull request?

Support Iceberg time travel in SQL queries

Why are the changes needed?

supports time travel in SQL queries using TIMESTAMP AS OF, FOR SYSTEM_TIME AS OF or VERSION AS OF, FOR SYSTEM_VERSION AS OF clauses.

Fix: #3264

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New ITs.

@caican00 caican00 marked this pull request as draft May 4, 2024 06:58
@caican00 caican00 marked this pull request as ready for review May 15, 2024 06:48
@caican00
Copy link
Contributor Author

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

  1. Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.

  2. However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name.
https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

  1. Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

@caican00
Copy link
Contributor Author

Finally, I still choose to implement Iceberg time travel by overriding newScanBuilder, for the following reasons:

  1. Although SparkIcebergTable extended SparkTable, it still needs to initialize its member variables, such as snapshotId or branch, before it can directly reuse the newScanBuilder implementation of SparkTable.
  2. However, initializing snapshotId or branch is difficult, not as easy as initializing refreshEagerly, because it is difficult to determine snapshotId or branch is initialized in the real sparkTable. Therefore, it is difficult to selectively initialize snapshotId or branch through the super method.

In this case, users specify the version for time travel, but the version may be snapshotId or branch name. https://github.com/apache/iceberg/blob/2058053b0c6e5b1c7e91fa029162f22d109aafb1/spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L195-L199

  1. Therefore, overriding newScanBuilder is more convenient and does not introduce too much maintenance burden.

cc @FANNG1

@caican00 caican00 marked this pull request as draft May 15, 2024 08:02
@caican00 caican00 marked this pull request as ready for review May 15, 2024 08:02
@FANNG1
Copy link
Contributor

FANNG1 commented May 16, 2024

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

@caican00
Copy link
Contributor Author

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

@FANNG1
Copy link
Contributor

FANNG1 commented May 16, 2024

How about Override loadTable(Identifier ident, String version) and loadTable(Identifier ident, String version) for GravitinoIcebergCatalog?

in this way, we still have the problem of initializing the snapshotId or branch when invoking super method of SparkTable.

Let me think a while

@caican00
Copy link
Contributor Author

caican00 commented May 17, 2024

fixed conflict, and could you please help review again if you are free? Thank you @FANNG1

@caican00 caican00 requested a review from FANNG1 May 20, 2024 01:38
@FANNG1
Copy link
Contributor

FANNG1 commented May 20, 2024

LGTM, except minor comments

@caican00
Copy link
Contributor Author

LGTM, except minor comments

comments have been addressed, and could you please help review again? @FANNG1

@FANNG1 FANNG1 merged commit 9929a99 into datastrato:main May 20, 2024
22 checks passed
github-actions bot pushed a commit that referenced this pull request May 20, 2024
…ries (#3265)

### What changes were proposed in this pull request?

Support Iceberg time travel in SQL queries

### Why are the changes needed?
supports time travel in SQL queries using `TIMESTAMP AS OF`, `FOR
SYSTEM_TIME AS OF` or `VERSION AS OF`, `FOR SYSTEM_VERSION AS OF`
clauses.

Fix: #3264

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New ITs.
@FANNG1
Copy link
Contributor

FANNG1 commented May 20, 2024

@caican00 , thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Support Iceberg time travel in SQL queries
2 participants