Issue with loading terminology seeds from S3 on Databricks #448

sarah-tuva · 2024-04-25T22:31:26Z

Describe the bug
Bug reported by someone on Slack. They are not able to use the post-hook macro to load seeds from S3 into their Databricks warehouse.

Struggling a bit to load terminologies to DBX w/ Unity (deets here). If I were to just load the terminologies manually, what should I comment out in the DBT project to make sure the terminology tables are not overwritten?

This turned out to be simple. For posterity:
If using a "shared access" cluster or a SQL Serverless warehouse (same idea), the s3 copies fail b/c it's not possible to set the environment variables w/ Tuva bucket keys.
Everything works fine on a "single user cluster", where the user can set environment variables.

I think any config changes would likely have to occur on DBX side, i.e. registering the s3 keys up front in the cluster configuration or registering an "external volume".
For me, the confusion originated from dbt-databricks docs which recommend running on the SQL Warehouse product b/c it's easy to monitor t and debug the queries DBT generates. But I'm doubtful you can connect directly to s3 on the warehouse product without mediating via "Unity Catalog" (by design)

https://www.databricks.com/blog/admin-isolation-shared-clusters

dr00b · 2024-05-11T13:21:52Z

Still no luck here, but it's fin. We can load the seeds on a single user cluster, then schedule all the SQL against the warehouse (you can run tasks on separate clusters in a databricks job.).

But it did occur to me...

The common denominator w/ everyone running this project is python. You need it set up dbt, therefore python to run Tuva.

Perhaps there is an approach to getting s3 seeds up via boto3, then stream the inserts statements into w/e warehouse DBT is connected to? I'm eyeing, but haven't explored below...
https://docs.getdbt.com/docs/build/python-models

cc: @sarah-tuva

sarah-tuva added the bug Something isn't working label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with loading terminology seeds from S3 on Databricks #448

Issue with loading terminology seeds from S3 on Databricks #448

sarah-tuva commented Apr 25, 2024

dr00b commented May 11, 2024

Issue with loading terminology seeds from S3 on Databricks #448

Issue with loading terminology seeds from S3 on Databricks #448

Comments

sarah-tuva commented Apr 25, 2024

dr00b commented May 11, 2024