Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

hwspuijbroek · 2024-04-02T08:13:33Z

Describe the feature

I propose enhancing the dbt Databricks adapter to support incremental models utilizing Databricks' Delta Lake CHANGE DATA FEED feature, specifically by enabling the configuration option delta.enableChangeDataFeed=true and integrating the table_changes function within dbt's incremental strategy (merge process). This integration should facilitate more efficient and robust data transformation pipelines, especially in multi-layered data architectures like bronze-to-silver and silver-to-gold processes. The goal is to leverage Delta Lake's ability to track changes at a row level, thereby enabling more granular and efficient data processing in dbt.

Describe alternatives you've considered

Currently, the most common approach for handling incremental data transformations in dbt involves filtering data based on a specific timestamp or date column. This method, while effective in certain scenarios, can be less efficient or reliable, particularly in cases where the timestamp data may not accurately capture all necessary changes, or in complex data architectures where changes need to be tracked across multiple layers or tables.

Another alternative could be manually implementing logic to simulate the CHANGE DATA FEED feature within dbt models, but this approach would likely be cumbersome, error-prone, and not as performant as a native integration.

Additional context

The motivation for this feature request is inspired by the capabilities demonstrated in the Databricks SQL context, as outlined in the article "Incremental, Parameterized, and Dynamic SQL all on Databricks SQL". The article showcases the power of Delta Lake's CHANGE DATA FEED and the table_changes function in tracking and processing data changes efficiently. A native integration of these features within dbt's Databricks adapter would significantly enhance the tool's capabilities for handling incremental data transformations in a more granular and efficient manner.

Who will this benefit?

This feature would benefit data engineers and data architects who utilize dbt in conjunction with Databricks for complex data transformation and ETL pipelines, particularly in environments where data is organized into multi-tiered storage layers (e.g., bronze, silver, gold). Examples include:

Data teams managing large and rapidly changing datasets, where efficient incremental processing can lead to significant performance gains and cost savings.
Organizations implementing complex data governance and lineage tracking requirements, where the ability to capture and process granular changes at the row level can enhance data auditability and quality.
Teams migrating from traditional ETL tools to modern data stack technologies, seeking to leverage advanced features of Databricks Delta Lake within dbt workflows.
Implementing this feature would align dbt more closely with the advanced capabilities of Databricks, thereby enhancing its value proposition for data teams working in dynamic and data-intensive environments.

benc-db · 2024-04-02T18:06:59Z

I don't think you need any changes in the adapter to facilitate this. You can set tblproperties on your bronze tables to turn on CDC, and then in your is_incremental block, look for max _commit_version. What is missing?

hwspuijbroek · 2024-04-09T10:06:15Z

Hi @benc-db you are correct. But you really need to known how to work with variables for retreiving the correct (latest) commit version for example. It is not that it works based on specific parameters. I didn't see a lot off information about this feature and how you should implement this in DBT (with variables for example). So Maybe it is a lack off my knowledge about this feature and dbt implementation.

benc-db · 2024-04-09T15:28:52Z

{{ config(
    materialized = 'incremental',
    tblproperties = {'delta.enableChangeDataFeed': 'true'}
)}}
...

{% if is_incremental() %}

  where _commit_version > (select max(_commit_version) from {{ this }})

{% endif %}

Let me know if this sort of approach works.

benc-db · 2024-04-09T15:29:31Z

err, the changeDataFeed should be set on the source table, not this one, I think.

hwspuijbroek · 2024-04-22T18:23:16Z

Thanks! But in this case a ChangeDataFeed should be enabled on source and target to make it possible to check the _commit_versions?

hwspuijbroek added the enhancement New feature or request label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

hwspuijbroek commented Apr 2, 2024

benc-db commented Apr 2, 2024

hwspuijbroek commented Apr 9, 2024

benc-db commented Apr 9, 2024

benc-db commented Apr 9, 2024

hwspuijbroek commented Apr 22, 2024 •

edited

Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

Feature request: Support for Delta Lake 'CHANGE DATA FEED' in Incremental Models within dbt Databricks Adapter #630

Comments

hwspuijbroek commented Apr 2, 2024

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

benc-db commented Apr 2, 2024

hwspuijbroek commented Apr 9, 2024

benc-db commented Apr 9, 2024

benc-db commented Apr 9, 2024

hwspuijbroek commented Apr 22, 2024 • edited

hwspuijbroek commented Apr 22, 2024 •

edited