Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MERGE should raise when multiple source rows match target row #2407

Open
mrjsj opened this issue Apr 10, 2024 · 0 comments
Open

MERGE should raise when multiple source rows match target row #2407

mrjsj opened this issue Apr 10, 2024 · 0 comments
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate bug Something isn't working

Comments

@mrjsj
Copy link

mrjsj commented Apr 10, 2024

Environment

Delta-rs version: 0.16.4

Binding: Python 0.16.4

Environment:

  • Cloud provider: None
  • OS: MacOS Sonoma 14.4
  • Other:

Bug

What happened:
TableMerger.when_matched_update_all() inserts records when there is a match on multiple source columns
TableMerger.when_matched_update() inserts records when there is a match on multiple source columns

What you expected to happen:
TableMerger.when_matched_update_all() should throw an error if a target record matches multiple source records
TableMerger.when_matched_update() should throw an error if a target record matches multiple source records

They should never insert new records.

How to reproduce it:
Using polars 0.20.7

import polars as pl
from deltalake import DeltaTable

base_df = pl.DataFrame(
    {
        "id": [1, 2],
        "attr": ["x", "y"]
    }
)

base_df.write_delta("./test_cdc", mode="overwrite")

dt = DeltaTable("./test_cdc")

print(pl.DataFrame(dt.to_pyarrow_table()))

cdc_df = pl.DataFrame(
    {
        "id": [1,1,1,2,2],
        "attr": ["a","b","c","d","e"],
        "op": ["U", "U", "U", "U", "U"]
    }
)


(
    dt.merge(
        cdc_df.to_arrow(),
        "s.id = t.id",
        source_alias="s",
        target_alias="t",
    )
    .when_matched_update(
        updates={"t.attr": "s.attr"},
        predicate="s.op = 'U'")
    .execute()
)

print(pl.DataFrame(dt.to_pyarrow_table()))

Gives the following output

base table

shape: (2, 2)
┌─────┬──────┐
│ id  ┆ attr │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ x    │
│ 2   ┆ y    │
└─────┴──────┘

After TableMerger is executed

shape: (5, 2)
┌─────┬──────┐
│ id  ┆ attr │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 2   ┆ d    │
│ 2   ┆ e    │
│ 1   ┆ a    │
│ 1   ┆ b    │
│ 1   ┆ c    │
└─────┴──────┘

More details:
Info on the specific case in slack: https://delta-users.slack.com/archives/C013LCAEB98/p1712673309723829

@mrjsj mrjsj added the bug Something isn't working label Apr 10, 2024
@ion-elgreco ion-elgreco changed the title TableMerger.when_matched_update() wrongly inserts rows MERGE should raise when multiple source rows match target row Apr 10, 2024
@ion-elgreco ion-elgreco added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants