Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce CDC write-side support for the Update operations #2486

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rtyler
Copy link
Member

@rtyler rtyler commented May 7, 2024

This change introduces a CDCTracker which helps collect changes during merges and update. This is admittedly rather inefficient, but my hope is that this provides a place to start iterating and improving upon the writer code

There is still additional work which needs to be done to handle table features properly for other code paths (see the middleware discussion we have had in Slack) but this produces CDC files for Update operations

Fixes #604
Fixes #2095

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 7, 2024
@rtyler rtyler added this to the Rust v0.18 milestone May 7, 2024
@rtyler rtyler marked this pull request as ready for review May 7, 2024 16:21
@rtyler rtyler enabled auto-merge (rebase) May 7, 2024 16:21
@rtyler rtyler requested a review from ion-elgreco May 8, 2024 22:24
@ion-elgreco
Copy link
Collaborator

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

// Add the AddCDCFile actions that exist to the commit
actions.extend(writer.close().await?.into_iter().map(|add| {
Action::Cdc(AddCDCFile {
// This is a gnarly hack, but the action needs the nested path, not the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we pass the write_action_type to the writer_config and then use some generic return types, so that based on the writer_config we either return Action::Cdc or Action::Add?

Then you don't have to add this hack into every operation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writer_config is already a mess, I do not want to add more of anything there. Since I started this pull request we added another field there 🙃

I wouldn't call this hack a hack though! That makes it sounds so ugly 😆 The concern I have with making the DeltaWriter generic with regards to the Action generated is that the Add has some more expansive requirements with file stats.

One option I considered was adding an From<Add> on AddCDCFile to at least relegate this code into a better location. I'm not happy with our writer structure at the moment anyways, and it's due for some surgery but I don't want to do that surgery here 😦 `

let input_schema = snapshot.input_schema()?;
// Wrap the scan with a CDCObserver if CDC has been abled so that the tracker can
// later be used to produce the CDC files
let scan: Arc<dyn ExecutionPlan> = match snapshot.table_config().enable_change_data_feed() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the table is at v7 writer, then we should only look at the writer_feature whether it exists or not.

In spark it's possible to disable change data feed through the config, even though the writer feature exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this comment. Are you stating that our enable_change_data_feed() function is incorrect? I understand what portion of the protocol you're referring to here, but I'm not clear what change is being suggested

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function only grabs it from the config, which is valid for writer versions 1..6, but for writer version >=7 we need to look at the writer's features in the protocol if ChangeDateFeed is enabled there.

@rtyler
Copy link
Member Author

rtyler commented May 12, 2024

I think it's better to disable it until all operations (delete and merge) are supported, otherwise we cannot push any python releases until those are added

How would you disable this? It doesn't make sense to me to include short-term configuration or feature flags to me. The protocol states that when the enable change data feed table-feature is enabled, that writers can optionally produce CDC files. Our writers just optionally will only create them on updates for now 😆

#[cfg(feature = "cdf")]
{
writer_features.insert(WriterFeatures::ChangeDataFeed);
writer_features.insert(WriterFeatures::GeneratedColumns);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I don't remember generated columns being required

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco writer versions 4-6 require this 😦 [see here](If the table is on a Writer Version starting from 4 up to 6, Generated Columns are always supported.)

Writer versions before 7 are annoying , they're annoying after 7 too 🤣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we are not adding the support here, right?

Protocol states this: "The value of delta.generationExpression SHOULD be parsed as a SQL expression.
Writers MUST enforce that any data writing to the table satisfy the condition ( <=> ) IS TRUE. <=> is the NULL-safe equal operator which performs an equality comparison like the = operator but returns TRUE rather than NULL if both operands are NULL"

We might only support v7 with CDF at this stage

This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change Data Feed in Delta Support CDC
2 participants