Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

Open
hwspuijbroek opened this issue Apr 2, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@hwspuijbroek
Copy link

Describe the feature

I propose enhancing the dbt Databricks adapter to support incremental models utilizing Databricks' Delta Lake CHANGE DATA FEED feature, specifically by enabling the configuration option delta.enableChangeDataFeed=true and integrating the table_changes function within dbt's incremental strategy (merge process). This integration should facilitate more efficient and robust data transformation pipelines, especially in multi-layered data architectures like bronze-to-silver and silver-to-gold processes. The goal is to leverage Delta Lake's ability to track changes at a row level, thereby enabling more granular and efficient data processing in dbt.

Describe alternatives you've considered

Currently, the most common approach for handling incremental data transformations in dbt involves filtering data based on a specific timestamp or date column. This method, while effective in certain scenarios, can be less efficient or reliable, particularly in cases where the timestamp data may not accurately capture all necessary changes, or in complex data architectures where changes need to be tracked across multiple layers or tables.

Another alternative could be manually implementing logic to simulate the CHANGE DATA FEED feature within dbt models, but this approach would likely be cumbersome, error-prone, and not as performant as a native integration.

Additional context

The motivation for this feature request is inspired by the capabilities demonstrated in the Databricks SQL context, as outlined in the article "Incremental, Parameterized, and Dynamic SQL all on Databricks SQL". The article showcases the power of Delta Lake's CHANGE DATA FEED and the table_changes function in tracking and processing data changes efficiently. A native integration of these features within dbt's Databricks adapter would significantly enhance the tool's capabilities for handling incremental data transformations in a more granular and efficient manner.

Who will this benefit?

This feature would benefit data engineers and data architects who utilize dbt in conjunction with Databricks for complex data transformation and ETL pipelines, particularly in environments where data is organized into multi-tiered storage layers (e.g., bronze, silver, gold). Examples include:

Data teams managing large and rapidly changing datasets, where efficient incremental processing can lead to significant performance gains and cost savings.
Organizations implementing complex data governance and lineage tracking requirements, where the ability to capture and process granular changes at the row level can enhance data auditability and quality.
Teams migrating from traditional ETL tools to modern data stack technologies, seeking to leverage advanced features of Databricks Delta Lake within dbt workflows.
Implementing this feature would align dbt more closely with the advanced capabilities of Databricks, thereby enhancing its value proposition for data teams working in dynamic and data-intensive environments.

@hwspuijbroek hwspuijbroek added the enhancement New feature or request label Apr 2, 2024
@benc-db
Copy link
Collaborator

benc-db commented Apr 2, 2024

I don't think you need any changes in the adapter to facilitate this. You can set tblproperties on your bronze tables to turn on CDC, and then in your is_incremental block, look for max _commit_version. What is missing?

@hwspuijbroek
Copy link
Author

Hi @benc-db you are correct. But you really need to known how to work with variables for retreiving the correct (latest) commit version for example. It is not that it works based on specific parameters. I didn't see a lot off information about this feature and how you should implement this in DBT (with variables for example). So Maybe it is a lack off my knowledge about this feature and dbt implementation.

@benc-db
Copy link
Collaborator

benc-db commented Apr 9, 2024

{{ config(
    materialized = 'incremental',
    tblproperties = {'delta.enableChangeDataFeed': 'true'}
)}}
...

{% if is_incremental() %}

  where _commit_version > (select max(_commit_version) from {{ this }})

{% endif %}

Let me know if this sort of approach works.

@benc-db
Copy link
Collaborator

benc-db commented Apr 9, 2024

err, the changeDataFeed should be set on the source table, not this one, I think.

@hwspuijbroek
Copy link
Author

hwspuijbroek commented Apr 22, 2024

Thanks! But in this case a ChangeDataFeed should be enabled on source and target to make it possible to check the _commit_versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants