Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with loading terminology seeds from S3 on Databricks #448

Open
sarah-tuva opened this issue Apr 25, 2024 · 1 comment
Open

Issue with loading terminology seeds from S3 on Databricks #448

sarah-tuva opened this issue Apr 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@sarah-tuva
Copy link
Member

Describe the bug
Bug reported by someone on Slack. They are not able to use the post-hook macro to load seeds from S3 into their Databricks warehouse.

Struggling a bit to load terminologies to DBX w/ Unity (deets here). If I were to just load the terminologies manually, what should I comment out in the DBT project to make sure the terminology tables are not overwritten?

This turned out to be simple. For posterity:
If using a "shared access" cluster or a SQL Serverless warehouse (same idea), the s3 copies fail b/c it's not possible to set the environment variables w/ Tuva bucket keys.
Everything works fine on a "single user cluster", where the user can set environment variables.

I think any config changes would likely have to occur on DBX side, i.e. registering the s3 keys up front in the cluster configuration or registering an "external volume".
For me, the confusion originated from dbt-databricks docs which recommend running on the SQL Warehouse product b/c it's easy to monitor t and debug the queries DBT generates. But I'm doubtful you can connect directly to s3 on the warehouse product without mediating via "Unity Catalog" (by design)

https://www.databricks.com/blog/admin-isolation-shared-clusters

@sarah-tuva sarah-tuva added the bug Something isn't working label Apr 25, 2024
@dr00b
Copy link

dr00b commented May 11, 2024

Still no luck here, but it's fin. We can load the seeds on a single user cluster, then schedule all the SQL against the warehouse (you can run tasks on separate clusters in a databricks job.).

But it did occur to me...

The common denominator w/ everyone running this project is python. You need it set up dbt, therefore python to run Tuva.

Perhaps there is an approach to getting s3 seeds up via boto3, then stream the inserts statements into w/e warehouse DBT is connected to? I'm eyeing, but haven't explored below...
https://docs.getdbt.com/docs/build/python-models

cc: @sarah-tuva

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants