Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(test): integration test for S3 log store with pyspark #1988

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

dispanser
Copy link
Contributor

Description

This adds an integration test to make sure that delta-rs and pyspark can write to a shared delta table concurrently, using the same lock table in a compatible way.

Due to the way pyspark instantiates the SparkSession there's only one session for the entire test run, so we must set up the S3-relevant configs globally, for all tests. This doesn't seem to interfere with the other tests as they write locally and ignore any S3-related configuration.

@github-actions github-actions bot added the binding/python Issues for the Python package label Dec 21, 2023
@dispanser
Copy link
Contributor Author

@rtyler : the new test doesn't work out of the box, as it requires S3 credentials in the form of environment variables:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_REGION

AWS_PROFILE might work as well instead, but I haven't tested that.

@dispanser dispanser force-pushed the spark-integration-test branch 3 times, most recently from 29f3afb to 5f1e278 Compare December 21, 2023 13:48
@dispanser
Copy link
Contributor Author

it may be necessary to mark this test with a separate tag so it can be excluded from a normal CI run, unless we have some credentials we want to stick into GitHub

@rtyler rtyler self-assigned this Dec 22, 2023
@rtyler
Copy link
Member

rtyler commented Dec 22, 2023

I think I can add some credentials to an AWS account I can control and structure specifically for this purpose. That said, it looks like I'm going to have to create an IAM user with a brought swath of permissions, have you been testing basically with an admin-level IAM user, or do you have a specific set of IAM permissions in mind?

@rtyler rtyler marked this pull request as draft December 22, 2023 17:44
@dispanser
Copy link
Contributor Author

I completely forgot about permissions and their implications while writing this. 😫

I've been using a very unrestricted user, indeed. The way it currently works is by creating a bucket and a table for each run, but that's purely for the sake of setup-free reproducability. To reduce the required permissions, we could modify the tests to accept an existing bucket + table via environment , which would allows to get away with very limited permissions - basically reading from + writing to a specific bucket and table. That table could also be pre-configured with a proper TTL and very small provisioned throughput for neatly fitting into any free AWS tier.

@dispanser dispanser force-pushed the spark-integration-test branch 3 times, most recently from f602332 to 2a23837 Compare December 30, 2023 10:59
@rtyler
Copy link
Member

rtyler commented Jan 3, 2024

@dispanser The big refactor just landed, can you rebase and rework this pull request accordingly? 😄

@rtyler rtyler added this to the Rust v0.17 milestone Jan 3, 2024
@dispanser
Copy link
Contributor Author

@rtyler : rebase done. What do you think about my proposal in the previous comment?

@rtyler rtyler force-pushed the spark-integration-test branch 2 times, most recently from 88984a6 to 6cfc616 Compare January 24, 2024 08:00
dispanser and others added 4 commits January 30, 2024 13:56
This still is not working, but it's not totally failing I guess
# Description
This PR upgrades `delta-rs` to using DataFusion 35.0, which was recently
released. In order to do this, I had to fix a few breaking changes, and
also upgrade Arrow to 50 and `sqlparser` to 0.41.

# Related Issue(s)
N/A

# Documentation
See here for the list of PRs which required code change:
- apache/datafusion#8703
-
https://github.com/apache/arrow-datafusion/blob/ec6abece2dcfa68007b87c69eefa6b0d7333f628/dev/changelog/35.0.0.md?plain=1#L227

---------

Co-authored-by: Ming Ying <ming.ying.nyc@gmail.com>
…ad (delta-io#2120)

# Description

Make sure the read path for delta table commit entries passes through
the log store, enabling it to ensure the invariants and potentially
repair a broken commit in the context of S3 / DynamoDb log store
implementation.

This also adds another test in the context of S3 log store: repairing a
log store on load was not implemented previously.
  
Note that this a stopgap and not a complete solution: it comes with a
performance penalty as we're triggering a redundant object store list
operation just for the purpose of "triggering" the log store
functionality.


fixes delta-io#2109

---------

Co-authored-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com>
Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Jan 31, 2024
@github-actions github-actions bot removed the binding/rust Issues for the Rust crate label Feb 1, 2024
@rtyler rtyler removed this from the Rust v0.17 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants