Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to extend to data vault 2.0 with Spark and/or BigQuery as target? #31

Open
dkapitan opened this issue Jan 22, 2022 · 3 comments

Comments

@dkapitan
Copy link

dkapitan commented Jan 22, 2022

Seems like I am the first one to post an issue here 馃槃. Very interested in diepvries, looking into it for one of my Dutch clients. We are considering the data vault 2.0 pattern for the datalake.

As described in this article there are a couple of issues that need solving for using data vault for datalake technologies like Databricks/Delta Lake or Google BigQuery. Most fundamental one is limited or non-existent functionality for constraints and foreign keys.

The article proposes to use deterministic hashkeys, which I can see working. So my question is, do you think it is doable to extend diepvries such that it can generate DDL for the target schema of the Raw Vault (that's our current scope, so not full DV2.0) for different target engines, most notably Databricks / Delta Lake and BigQuery.

Any suggestions are welcome, and if this is a workable solution we would be willing and able to contribute to diepvries.

image

@matthieucan
Copy link
Collaborator

Hey @dkapitan, thanks for your interest in diepvries!

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks? Do you also have other architectural changes in mind, e.g. some structures in the raw vault that can't be loaded by diepvries?

To keep you up-to-date with our vision for the future of this framework, we are currently experimenting with using SQLalchemy for all generated SQL statements, in order to have other SQL dialects. Using SQLalchemy doesn't automatically solve all problems, as some SQL statements can differ between dialects and are not always 100% supported (e.g. MERGE statements), but we believe it would be a good first step in that direction.
Once we're there, generating DDL statements (on top of DML statements for loading data) for target structures should be possible, as SQLalchemy allows that.

@dkapitan
Copy link
Author

dkapitan commented Feb 1, 2022

Hi @matthieucan, thanks for your response. Regarding your questions:

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks?

That is indeed what I am looking for.

Do you also have other architectural changes in mind, e.g. some structures in the raw vault that can't be loaded by diepvries?

For now, I just want to follow along with diepvries's existing architecture. As far as I understand it follows the current Data Vault 2.0 specification. Once we have a generic SQL back-end interface, we can start thinking about other features.

Going with SQL Alchemy makes perfect sense to me. Please let me know if there is any way I could help. For example, generating an overview with the key differences between dialects seems useful for refining our needs. I have worked a lot with BigQuery, and getting into Apache Spark more and more, so I could help with those.

@matthieucan
Copy link
Collaborator

Correct me if I'm wrong, but what you'd like to have is the ability for diepvries to generate SQL loading statement for other dialects than Snowflake, e.g. BigQuery and Databricks?

That is indeed what I am looking for.

Understood!

Going with SQL Alchemy makes perfect sense to me. Please let me know if there is any way I could help. For example, generating an overview with the key differences between dialects seems useful for refining our needs. I have worked a lot with BigQuery, and getting into Apache Spark more and more, so I could help with those.

While we can't commit to any timeline for this implementation, you might get inspired by this file in a proof-of-concept branch: https://github.com/PicnicSupermarket/diepvries/blob/matthieucan/sqlalchemy/src/diepvries/test_sqlalchemy.py
We intend to implement this for Snowflake as a first step. I think playing with those queries to estimate the efforts to make them compatible with BigQuery and Spark will be very valuable!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants